In [1]:
import pandas as pd
import numpy as np

from collections import Counter

import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.mixture import GaussianMixture

import boto3

import chart_studio
import chart_studio.plotly as py
C:\Users\jcf\AppData\Local\Programs\Python\Python311\Lib\site-packages\sentence_transformers\cross_encoder\CrossEncoder.py:11: TqdmExperimentalWarning: Using `tqdm.autonotebook.tqdm` in notebook mode. Use `tqdm.tqdm` instead to force console mode (e.g. in jupyter console)
  from tqdm.autonotebook import tqdm, trange

public data: https://www.kaggle.com/datasets/maharshipandya/-spotify-tracks-dataset

personal data: spotify API

The objective of this analysis is to get Spotify data for a particular user library and try to cluster the genres in 3 different ways and compare them. As the genre of a song is a partial subjective topic, conclusions could vary from person to person.

There are two datasets: one from the user, got using the Spotify API to retrieve the user's library tracks; the other is a public dataset containing 1.000.000 tracks with mostly the same information as the user data.

User and public data contain information of tracks, including title, artist, duration, and some features used internally in Spotify and assigned to each track, such as danceability, acousticness, etc. User data contains some extra features, such as number of sections, tempo changes, and some more.

The meaning of the features can be seen in the API documentation: https://developer.spotify.com/documentation/web-api/reference/get-track

Functions and constants¶

In [2]:
def read_full_table(table_name):
    session = boto3.Session(profile_name="default")
    dynamodb = session.resource("dynamodb", region_name="eu-west-1")
    table = dynamodb.Table(table_name)
    response = table.scan()
    data = response["Items"]

    while "LastEvaluatedKey" in response:
        response = table.scan(ExclusiveStartKey=response["LastEvaluatedKey"])
        data.extend(response["Items"])
        
    return pd.DataFrame(data)
In [3]:
numerical_cols = ["num_sections", "danceability", "sections_avg_duration", "instrumentalness", "liveness", "loudness",
                  "duration", "speechiness", "valence", "dynamics_changes", "tempo_changes", "acousticness",
                  "time_signature_changes", "popularity", "mode_changes", "energy", "key_changes", "tempo"]
numerical_cols_public = ["danceability", "instrumentalness", "liveness", "loudness", "duration", "speechiness", 
                         "valence", "acousticness", "popularity", "energy", "tempo"]

non_standarized_cols = ["num_sections", "sections_avg_duration", "loudness", "duration", "dynamics_changes", 
                         "tempo_changes", "time_signature_changes", "popularity", "mode_changes", "key_changes", "tempo"]
categorical_cols = ["key", "mode"]

notes = ("C", "C#", "D", "Eb", "E", "F", "F#", "G", "Ab", "A", "Bb", "B")
key_mapping = {i:note for i, note in enumerate(notes)}
key_mapping[-1] = "NoKey"

mode_mapping = {0: "Minor", 1: "Major"}

random_state = 602452

Preprocess¶

In [4]:
ssm = boto3.client("ssm", region_name="eu-west-1")
chart_studio_api_key = ssm.get_parameter(Name="CHART_STUDIO_API_KEY", WithDecryption=True)
chart_studio_api_key = chart_studio_api_key["Parameter"]["Value"]

chart_studio.tools.set_credentials_file(username='jcf94', api_key=chart_studio_api_key)
In [5]:
track_info_raw = read_full_table("track_info")
public_data_raw = pd.read_csv(r"C:\Users\jcf\Desktop\codigo\Portfolio\Spotify Analysis\public_music_data.csv")
In [6]:
track_info = track_info_raw.copy()
public_data = public_data_raw.copy()
In [7]:
track_info["key"] = track_info["key"].map(key_mapping)
track_info["mode"] = track_info["mode"].map(mode_mapping)

public_data["key"] = public_data["key"].map(key_mapping)
public_data["mode"] = public_data["mode"].map(mode_mapping)
In [8]:
for col in numerical_cols:
    track_info[col] = pd.to_numeric(track_info[col])
    
for col in track_info.columns:
    if "changes" in col:
        track_info[col] = track_info[col] / track_info["duration"]

track_info["case"] = "user"
track_info = track_info.loc[track_info["genres"].str.len()>0, :]

public_data["duration"] = public_data["duration_ms"] / 1000
public_data["case"] = "public"

genre_info = track_info.explode("genres")
genre_info = genre_info.loc[:, ["genres", "track_id", "artist"]]
genre_info = genre_info.groupby(["genres"]).nunique().reset_index()
genre_info["track_perc"] = 100 * genre_info["track_id"] / genre_info["track_id"].sum()
genre_info["artist_perc"] = 100 * genre_info["artist"] / genre_info["artist"].sum()

genre_info_public = public_data.loc[:, ["track_genre", "track_id", "artists"]]
genre_info_public = genre_info_public.groupby(["track_genre"]).nunique().reset_index()
genre_info_public["track_perc"] = 100 * genre_info_public["track_id"] / genre_info_public["track_id"].sum()
genre_info_public["artist_perc"] = 100 * genre_info_public["artists"] / genre_info_public["artists"].sum()
In [9]:
for row in track_info.genres.sample(5):
    print(row)
['alternative metal', 'alternative rock', 'funk metal', 'funk rock', 'grunge', 'nu metal']
['alternative rock', 'funk metal', 'funk rock', 'permanent wave', 'rock']
['album rock', 'british invasion', 'classic rock', 'hard rock', 'rock']
['otacore']
['instrumental math rock', 'instrumental rock', 'math rock']

An example of genres are shown. As it can be seen, a song can have multiple genres.

Clustering¶

EDA¶

First, an Exploratory Data Analysis is done, to see how many genres are in each dataset and in which quantity.

In [10]:
genre_info_plot = genre_info.sort_values(by="track_perc", ascending=False)
genre_info_plot["track_perc_accum"] = genre_info_plot["track_perc"].cumsum()

limit = 90

total_elements = genre_info_plot["genres"].nunique()
top_elements = genre_info_plot.loc[genre_info_plot["track_perc_accum"] <= limit, "genres"].nunique()

print(f"{total_elements=}, top {limit}% elements={top_elements}")

fig = make_subplots(specs=[[{"secondary_y": True}]])

fig.add_trace(
    go.Scatter(x=genre_info_plot["genres"], y=genre_info_plot["track_perc_accum"], name="% Accum"),
    secondary_y=True,
)
fig.add_trace(
    go.Bar(x=genre_info_plot["genres"], y=genre_info_plot["track_perc"], name="%"),
    secondary_y=False,
)
fig.update_layout(
    title_text="User Track Genres"
)

fig.show()

py.plot(fig, filename="user_genres_distribution", auto_open=False)
total_elements=241, top 90% elements=83
Out[10]:
'https://plotly.com/~jcf94/3/'

The genres are shown for the user data, both by track and by artist (how many songs of a certain genre are in the user library, and how many artists are of a certain genre). Genre data is supported for artist in the Spotify API, and not by track, so for each track, the artist genre is used.

As it can be seen, this user listens mainly to rock and metal, but also some funk, jazz and fusion genres, as well as more alterntive.

There are over 240 different genres, and only 84 of them represent 90% of the total genres in the dataset. For the rest of the analysis, only these 84 genres are taken into account, as they represent the majority of the dataset, and a high number of groups can lead to problems and difficulty in the clustering (most groups are underrepresented, so the results are very mixed groups).

A comparison of the genres in the user data and public data is shown, to see if they are similar or very different. First, the genres in the user data that are not in the public database are shown, as well as the number of them, absolute and in percentage.

Arount 95% of genres in the user data are not in the public database, and around 90% of genres in the public database are not in the user data, so both databases are vastly different in terms of genres. This could create some problems in ML, as the public data does not represent the user data, and any algorithms made from one database could not be applicable to the other.

In [11]:
public_genres = genre_info_public["track_genre"].unique()
user_genres = genre_info["genres"].unique()

user_genres_not_public = set(user_genres).difference(set(public_genres))
pulic_genres_not_user = set(public_genres).difference(set(user_genres))

print(f"{user_genres_not_public=}\n({len(user_genres_not_public)}, {100 * len(user_genres_not_public) / len(user_genres)}%)")
print(f"{pulic_genres_not_user=}\n({len(pulic_genres_not_user)}, {100 * len(pulic_genres_not_user) / len(public_genres)})%")
user_genres_not_public={'electric bass', 'proto-metal', 'acid rock', 'comic', 'art rock', 'polish prog', 'jazz metal', 'comedy rock', 'uk doom metal', 'jazz fusion', 'stoner metal', 'spacegrunge', 'synth prog', 'instrumental rock', 'glam rock', 'progressive jazz fusion', 'rap metal', 'dance-punk', 'laboratorio', 'acoustic rock', 'merseybeat', 'madchester', 'instrumental bluegrass', 'sludge metal', 'louisville indie', 'southern soul', 'progressive death metal', 'abstract', 'funk metal', 'jazz funk', 'taiwan indie', 'british blues', 'indie rock', 'progressive bluegrass', 'permanent wave', 'shoegaze', 'post-grunge', 'classic canadian rock', 'modern progressive rock', 'piano rock', 'el paso indie', 'slacker rock', 'industrial metal', 'doom metal', 'japanese math rock', 'southeast asian post-rock', 'birmingham metal', 'swedish doom metal', 'atlanta metal', 'comic metal', 'cybergrind', 'progressive rock', 'flute rock', 'krautrock', 'avant-garde metal', 'zolo', 'soft rock', 'technical thrash', 'drill and bass', 'indietronica', 'electric blues', 'symphonic rock', 'instrumental djent', 'mexican classic rock', 'oxford indie', 'palm desert scene', 'nu gaze', 'indie catala', 'greek psychedelic rock', 'blues rock', 'nu metal', 'uk post-punk', 'rare groove', 'art pop', 'jam band', 'classic rock', 'experimental indie rock', 'british jazz', 'scottish rock', 'conscious hip hop', 'dream pop', 'afrofuturism', 'ann arbor indie', 'jazztronica', 'modern hard rock', 'political hip hop', 'taiwan post-rock', 'glam metal', 'cosmic american', 'cyberpunk', 'garage rock', 'british math rock', 'melodic thrash', 'electronica', 'hard rock', 'supergroup', 'dance pop', 'funktronica', 'post-metal', 'melbourne punk', 'progressive metalcore', 'progressive sludge', 'sacramento indie', 'stomp and holler', 'instrumental post-rock', 'cascadia psych', 'p funk', 'chapman stick', 'psychedelic rock', 'djent', 'alternative rock', 'alternative dance', 'alternative metal', 'italian progressive rock', 'classic texas country', 'new wave', 'art punk', 'instrumental math rock', 'australian psych', 'japanese vgm', 'metal cearense', 'atmospheric sludge', 'french shoegaze', 'roots rock', 'post-punk', 'australian metal', 'midwest emo', 'dark pop', 'synth funk', 'funk rock', 'german hard rock', 'israeli metal', 'yacht rock', 'anti-folk', 'microtonal', 'japanese post-rock', 'dance rock', 'rock drums', 'contemporary post-bop', 'norwegian prog', 'groove metal', 'german stoner rock', 'shimmer pop', 'texas metal', 'jazz piano', 'experimental pop', 'classic japanese jazz', 'speed metal', 'album rock', 'modern rock', 'post-rock', 'american post-rock', 'psychedelic soul', 'psychobilly', 'intelligent dance music', 'swedish metal', 'sci-fi metal', 'folk rock', 'noise rock', 'rap rock', 'math rock', 'atmospheric post-metal', 'bebop', 'trip hop', 'brazilian progressive metal', 'industrial rock', 'britpop', 'canadian metal', 'german rock', 'post-hardcore', 'prog metal', 'indie jazz', 'space rock', 'mellow gold', 'technical groove metal', 'jazz rock', 'old school thrash', 'no wave', 'crank wave', 'beatlesque', 'emotional black metal', 'sillycore', 'french death metal', 'breakcore', 'otacore', 'video game music', 'progressive metal', 'chamber pop', 'atmospheric black metal', 'country rock', 'modern blues rock', 'french black metal', 'post-black metal', 'nwobhm', 'progressive groove metal', 'canterbury scene', 'baroque', 'modern alternative rock', 'noise pop', 'north carolina metal', 'motown', 'parody', 'instrumental stoner rock', 'neo-psychedelic', 'opera metal', 'melodic metalcore', 'experimental', 'german metal', 'italian baroque', 'electronic djent', 'electronic rock', 'experimental rock', 'thrash metal', 'modern jazz trio', 'metal guitar', 'double drumming', 'french metal', 'shred', 'swedish progressive metal', 'blackgaze', 'classic soul', 'instrumental funk', 'late romantic era', 'stoner rock', 'san diego indie', 'gothic metal', 'boston rock', 'contemporary jazz', 'neo classical metal', 'british invasion', 'melancholia'}
(231, 95.850622406639%)
pulic_genres_not_user={'tango', 'death-metal', 'deep-house', 'kids', 'club', 'j-dance', 'forro', 'power-pop', 'sleep', 'brazil', 'j-pop', 'idm', 'dub', 'sad', 'chill', 'children', 'psych-rock', 'dubstep', 'black-metal', 'malay', 'heavy-metal', 'ambient', 'dance', 'j-rock', 'new-age', 'guitar', 'progressive-house', 'happy', 'alternative', 'honky-tonk', 'turkish', 'k-pop', 'breakbeat', 'mandopop', 'detroit-techno', 'indian', 'piano', 'iranian', 'french', 'folk', 'study', 'grindcore', 'world-music', 'disco', 'garage', 'cantopop', 'country', 'minimal-techno', 'trip-hop', 'swedish', 'hard-rock', 'j-idol', 'anime', 'reggae', 'show-tunes', 'indie', 'r-n-b', 'groove', 'pop', 'opera', 'comedy', 'rockabilly', 'electro', 'rock-n-roll', 'synth-pop', 'edm', 'hardcore', 'spanish', 'blues', 'dancehall', 'bluegrass', 'metalcore', 'sertanejo', 'pagode', 'gospel', 'afrobeat', 'drum-and-bass', 'chicago-house', 'salsa', 'goth', 'songwriter', 'party', 'alt-rock', 'house', 'hardstyle', 'acoustic', 'electronic', 'trance', 'mpb', 'hip-hop', 'punk', 'pop-film', 'punk-rock', 'techno', 'indie-pop', 'latino', 'romance', 'reggaeton', 'latin', 'ska', 'samba', 'german', 'british', 'disney'}
(104, 91.2280701754386)%

By genre text¶

In this section, instead of using the features of each track, genres are tried to be grouped using only the semantic similarity between them. This could group similar genres that share common words (such as a "main" genre and its subgenres: rock, alternative rock, progressive rock, funk rock, etc), but will probably fail to cluster genres that share common traits but no semantic similarity (for example, alt rock and indie, or djent and death metal).

A SentenceTransformer is used with a pre-trained model to convert the string into numerical features, and then a KMeans model is used for clustering.

All unique genres are clustered in 10 groups, and must be manually checked in order to see if they have sense.

A PCA is also applied to draw a simple 2D plot in order to see if the groups have some sense, although it's hard to see multidimensional information reduced to only 2 dimensions, so some info is lost when plotting.

In [12]:
genre_info_sorted = genre_info.sort_values(by=f"track_perc", ascending=False)
genre_info_sorted[f"track_perc_accum"] = genre_info_sorted[f"track_perc"].cumsum()

limit = 90

genre_info_sorted = genre_info_sorted.loc[genre_info_sorted[f"track_perc_accum"] <= limit, "genres"].unique()
genres =  genre_info_sorted.tolist()

model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

embeddings = model.encode(genres)

num_clusters = 10

kmeans = KMeans(n_clusters=num_clusters, random_state=random_state)
kmeans.fit(embeddings)
labels = [str(e) for e in kmeans.labels_]

# Print the genres grouped by cluster
clusters = {}
for genre, label in zip(genres, labels):
    if label not in clusters:
        clusters[label] = []
    clusters[label].append(genre)

clusters = dict(sorted(clusters.items()))

for cluster_id, genre_list in clusters.items():
    print(f"Cluster {cluster_id}: {', '.join(genre_list)}\n")

pca = PCA(n_components=2)
reduced_embeddings = pca.fit_transform(embeddings)

fig = px.scatter(reduced_embeddings[:, 0], reduced_embeddings[:, 1], color=labels)
fig.show()

py.plot(fig, filename="PCA_semantic_clustering", auto_open=False)
C:\Users\jcf\AppData\Local\Programs\Python\Python311\Lib\site-packages\huggingface_hub\file_download.py:1132: FutureWarning:

`resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.

C:\Users\jcf\AppData\Local\Programs\Python\Python311\Lib\site-packages\joblib\externals\loky\backend\context.py:136: UserWarning:

Could not find the number of physical cores for the following reason:
found 0 physical cores < 1
Returning the number of logical cores instead. You can silence this warning by setting LOKY_MAX_CPU_COUNT to the number of cores you want to use.

  File "C:\Users\jcf\AppData\Local\Programs\Python\Python311\Lib\site-packages\joblib\externals\loky\backend\context.py", line 282, in _count_physical_cores
    raise ValueError(f"found {cpu_count_physical} physical cores < 1")
Cluster 0: nu metal, metal, swedish metal, mellow gold, canadian metal, swedish progressive metal

Cluster 1: funk rock, funk metal, dance pop, instrumental funk, jam band, blues rock, shoegaze, funk, rap rock, p funk

Cluster 2: progressive jazz fusion, jazz, contemporary jazz, jazz rock, jazz fusion, jazz funk, jazz metal

Cluster 3: classic rock, album rock, symphonic rock, instrumental rock, psychedelic rock, el paso indie, oxford indie, classic canadian rock, electric bass, indie rock, instrumental math rock, sacramento indie, singer-songwriter

Cluster 4: rock, alternative rock, permanent wave, progressive rock, art rock, hard rock, modern rock, garage rock, microtonal, soft rock, stoner rock, math rock, palm desert scene, glam rock, new wave, post-rock, acid rock

Cluster 5: melancholia, old school thrash, melodic thrash, technical thrash, noise pop

Cluster 6: alternative metal, progressive metal, groove metal, progressive groove metal, french metal, french death metal, double drumming, rap metal, thrash metal, speed metal, stoner metal, technical groove metal, progressive death metal

Cluster 7: neo-psychedelic, uk post-punk, conscious hip hop, political hip hop, trip hop

Cluster 8: djent, instrumental djent

Cluster 9: grunge, post-grunge, australian psych, zolo, supergroup

Out[12]:
'https://plotly.com/~jcf94/5/'

Cluster 0 contains some metal genres, in particular some regional variants.

Cluster 1 seems to have rythmic genres, mostly funk and related, but also some "jam" genres (where there is room for improvisation), such as jam band and blues.

Cluster 2 includes mainly jazz and variations of jazz.

Cluster 3 is rock and several "classic" subgenres.

Cluster 4 comprises rock, but also some "weird" subgenres, such as microtonal and new wave. These genres tend to be more experimental.

Cluster 5 is mainly thrash and variants, but also "melancholia" and noise pop.

Cluster 6 is more metal subgenres, mainly prog.

Cluster 7 is a mix of hip hop and variations, and neo-psychedelic.

Cluster 8 is basically djent.

Finally, cluster 9 is grunge and some variations of it.

It seems the clustering is sensible, as most of the grouped genres share traits and could be grouped together if the objective is to reduce the number of genres (we went from 84 unique genres to 10).

In order to check if these groups have sense in the real data, we assign the clusters to each genre, and a sample of 10 songs are drawn randomly from each cluster. If the grouping is sensible, these tracks should really have common traits. It should be taken into account the fact that each track can have several genres, so a track can be in more than one cluster. As this multiple genre trait is not quantified (it's not known "how much of a genre" has a certain song), this could be problematic, as a track can be clustered in several groups but not all of them are "correct". How "correct" is a song is also subjective.

In [13]:
def assign_cluster(x):
    clusters_row = []
    for k, v in clusters.items():
        if set(x) & set(v):
            clusters_row.append(k)
            
    return clusters_row


track_info["cluster_NLP"] = track_info["genres"].apply(assign_cluster)
df_exploded = track_info.explode("cluster_NLP")

df_exploded_csv = df_exploded[categorical_cols + numerical_cols + ["cluster_NLP", "genres", "track_name", "artist"]]
df_exploded_csv.to_csv("df_exploded_clustering.csv")

for i in range(0, num_clusters):
    _df = df_exploded.loc[df_exploded["cluster_NLP"] == str(i), ["track_name", "artist", "genres", "track_url"]]
    elements = min(10, _df.shape[0])
    print(f"Cluster {i}")
    print(_df.sample(elements).values)
Cluster 0
[['Over Now' 'Alice In Chains'
  list(['alternative metal', 'alternative rock', 'grunge', 'hard rock', 'nu metal', 'rock'])
  'https://open.spotify.com/track/6uJhCao8wULMrDOZOuS5rc']
 ['Serein' 'Katatonia'
  list(['doom metal', 'gothic metal', 'progressive metal', 'swedish doom metal', 'swedish metal', 'swedish progressive metal'])
  'https://open.spotify.com/track/3ThZfByczQOSRlqLR2FGir']
 ['The Twilight Zone' 'Rush'
  list(['album rock', 'canadian metal', 'classic canadian rock', 'classic rock', 'hard rock', 'progressive rock', 'rock'])
  'https://open.spotify.com/track/0QHY2cPwl7GpgC4WO5Onql']
 ['All Secrets Known' 'Alice In Chains'
  list(['alternative metal', 'alternative rock', 'grunge', 'hard rock', 'nu metal', 'rock'])
  'https://open.spotify.com/track/0vQDhuk73PvmaloRibUiQr']
 ["Mind's Mirrors / In Death - Is Life / In Death - Is Death - Live"
  'Meshuggah'
  list(['alternative metal', 'djent', 'groove metal', 'metal', 'nu metal', 'progressive groove metal', 'swedish metal', 'technical groove metal', 'technical thrash'])
  'https://open.spotify.com/track/7snx3JnD5PxvEJ7SZeArA8']
 ['The Root of All Evil' 'Dream Theater'
  list(['metal', 'progressive metal'])
  'https://open.spotify.com/track/09pKnVI4uEh31Vvz7lVggD']
 ['Vital Signs' 'Rush'
  list(['album rock', 'canadian metal', 'classic canadian rock', 'classic rock', 'hard rock', 'progressive rock', 'rock'])
  'https://open.spotify.com/track/1k0GxoZYv3Yx5mNeXMOZN2']
 ['Falling Away from Me' 'Korn'
  list(['alternative metal', 'funk metal', 'hard rock', 'nu metal', 'post-grunge', 'rap metal', 'rock'])
  'https://open.spotify.com/track/2F6FfZ4w8z3eJpSxPotVO5']
 ['The Czar: Usurper / Escape / Martyr / Spiral' 'Mastodon'
  list(['alternative metal', 'atlanta metal', 'metal', 'progressive groove metal', 'progressive sludge', 'sludge metal', 'stoner metal', 'stoner rock'])
  'https://open.spotify.com/track/2LMjQnKH7sQzOD0l8q6eWz']
 ['Lethargica' 'Meshuggah'
  list(['alternative metal', 'djent', 'groove metal', 'metal', 'nu metal', 'progressive groove metal', 'swedish metal', 'technical groove metal', 'technical thrash'])
  'https://open.spotify.com/track/4FvuNTv7dcQtoByEePExgW']]
Cluster 1
[['One Better' 'Les Claypool' list(['funk metal', 'funk rock'])
  'https://open.spotify.com/track/45wViUjhlClQRK1cqM2ptP']
 ['Open Forum' 'Snarky Puppy'
  list(['contemporary jazz', 'funk rock', 'jazz', 'progressive jazz fusion'])
  'https://open.spotify.com/track/4vRmWT3EkB6clON5FOZmeA']
 ['No One to Depend On' 'Santana'
  list(['blues rock', 'classic rock', 'mexican classic rock'])
  'https://open.spotify.com/track/74lwRyZECS8PQOCYyHKje4']
 ['Black Summer' 'Red Hot Chili Peppers'
  list(['alternative rock', 'funk metal', 'funk rock', 'permanent wave', 'rock'])
  'https://open.spotify.com/track/3a94TbZOxhkI9xuNwYL53b']
 ['Turn It Again' 'Red Hot Chili Peppers'
  list(['alternative rock', 'funk metal', 'funk rock', 'permanent wave', 'rock'])
  'https://open.spotify.com/track/4gJgHqy4BVCIEcGvx0hGLw']
 ['The Sinister Minister' 'Béla Fleck and the Flecktones'
  list(['jam band', 'progressive bluegrass'])
  'https://open.spotify.com/track/2jWuNKBlgkfb3M0WDKexY8']
 ['Bag of Grins' 'Red Hot Chili Peppers'
  list(['alternative rock', 'funk metal', 'funk rock', 'permanent wave', 'rock'])
  'https://open.spotify.com/track/1TmBaKkDUu0akM9xzSxRia']
 ['Jefe' 'Snarky Puppy'
  list(['contemporary jazz', 'funk rock', 'jazz', 'progressive jazz fusion'])
  'https://open.spotify.com/track/5qxc0Nw40J7m6mAmosTxwP']
 ["Doin' It" 'Herbie Hancock'
  list(['contemporary post-bop', 'instrumental funk', 'jazz', 'jazz funk', 'jazz fusion', 'jazz piano'])
  'https://open.spotify.com/track/3qQVUOHJdgIFWJd0jrG9GE']
 ['We Believe' 'Red Hot Chili Peppers'
  list(['alternative rock', 'funk metal', 'funk rock', 'permanent wave', 'rock'])
  'https://open.spotify.com/track/3Or10XF8LCimAlD8k4TmCn']]
Cluster 2
[['Teen Town' 'Weather Report'
  list(['bebop', 'contemporary post-bop', 'electric bass', 'jazz', 'jazz funk', 'jazz fusion'])
  'https://open.spotify.com/track/4OzXE9NnSdD9aEAwBcnYBI']
 ['Fracture' 'King Crimson'
  list(['art rock', 'instrumental rock', 'jazz rock', 'progressive rock', 'psychedelic rock', 'symphonic rock', 'zolo'])
  'https://open.spotify.com/track/5ipS4tCSxn0z7NL6To7umt']
 ['The Great Deceiver' 'King Crimson'
  list(['art rock', 'instrumental rock', 'jazz rock', 'progressive rock', 'psychedelic rock', 'symphonic rock', 'zolo'])
  'https://open.spotify.com/track/62exguNzjyjvZWVhctFRq4']
 ['Sleepless' 'King Crimson'
  list(['art rock', 'instrumental rock', 'jazz rock', 'progressive rock', 'psychedelic rock', 'symphonic rock', 'zolo'])
  'https://open.spotify.com/track/0JwoMuwai5esGaMOkEAXCF']
 ['Flight - Live From Dordrecht, Het Energiehuis / 2014' 'Snarky Puppy'
  list(['contemporary jazz', 'funk rock', 'jazz', 'progressive jazz fusion'])
  'https://open.spotify.com/track/7eum7FacBgs9ZwUdt1r9g8']
 ['Come On, Come Over' 'Jaco Pastorius'
  list(['contemporary post-bop', 'electric bass', 'jazz', 'jazz funk', 'jazz fusion'])
  'https://open.spotify.com/track/4RflGBhUpxjrs2fMJD0QVX']
 ['Elephant Talk' 'King Crimson'
  list(['art rock', 'instrumental rock', 'jazz rock', 'progressive rock', 'psychedelic rock', 'symphonic rock', 'zolo'])
  'https://open.spotify.com/track/1VeYMKim09aEymk9grhXRf']
 ['The Curtain - Live From Dordrecht, Het Energiehuis / 2014'
  'Snarky Puppy'
  list(['contemporary jazz', 'funk rock', 'jazz', 'progressive jazz fusion'])
  'https://open.spotify.com/track/29ls98FgNdZHbmqdQeF7E6']
 ['125th Street Congress' 'Weather Report'
  list(['bebop', 'contemporary post-bop', 'electric bass', 'jazz', 'jazz funk', 'jazz fusion'])
  'https://open.spotify.com/track/7qJkuqvDhyD20D1JCK2Aqy']
 ['Home' 'GoGo Penguin'
  list(['british jazz', 'contemporary jazz', 'indie jazz', 'jazztronica', 'modern jazz trio', 'progressive jazz fusion'])
  'https://open.spotify.com/track/0NLvWEG2HR5dlxweMSRRHk']]
Cluster 3
[['Vermicide' 'The Mars Volta' list(['el paso indie', 'garage rock'])
  'https://open.spotify.com/track/31WBIMEaJP63p6HtLKJhag']
 ['Talk Show Host' 'Radiohead'
  list(['alternative rock', 'art rock', 'melancholia', 'oxford indie', 'permanent wave', 'rock'])
  'https://open.spotify.com/track/3cMuGOGSaTWbwOurTS4b3Y']
 ['Across The Universe - Remastered 2009' 'The Beatles'
  list(['british invasion', 'classic rock', 'merseybeat', 'psychedelic rock', 'rock'])
  'https://open.spotify.com/track/4dkoqJrP0L8FXftrMZongF']
 ['Cassandra Gemini: Tarantism' 'The Mars Volta'
  list(['el paso indie', 'garage rock'])
  'https://open.spotify.com/track/6cOME6f17Y1f0wr21aTXpZ']
 ['Frame By Frame' 'King Crimson'
  list(['art rock', 'instrumental rock', 'jazz rock', 'progressive rock', 'psychedelic rock', 'symphonic rock', 'zolo'])
  'https://open.spotify.com/track/0yg93GXlS0ZmLsFpXG5bT2']
 ['War Pigs' 'Black Sabbath'
  list(['album rock', 'alternative metal', 'birmingham metal', 'classic rock', 'hard rock', 'metal', 'rock', 'stoner rock', 'uk doom metal'])
  'https://open.spotify.com/track/0W35nxtHtFlseSojmygEsf']
 ['Siberian Khatru - 2003 Remaster' 'Yes'
  list(['album rock', 'art rock', 'classic rock', 'hard rock', 'progressive rock', 'rock', 'soft rock', 'symphonic rock'])
  'https://open.spotify.com/track/1nyLujWRDFnsuKkz1Iq387']
 ['An Infinite Regression' 'Animals As Leaders'
  list(['djent', 'instrumental djent', 'instrumental rock', 'jazz metal', 'progressive jazz fusion', 'progressive metal'])
  'https://open.spotify.com/track/6TusaYvpkgyJhJ8tvXo03P']
 ['Sextape' 'Deftones'
  list(['alternative metal', 'nu metal', 'rap metal', 'rock', 'sacramento indie'])
  'https://open.spotify.com/track/1EryAkZ0VHstC6haIxVBiE']
 ['Dancing With The Moonlit Knight - Remastered 2008' 'Genesis'
  list(['album rock', 'art rock', 'classic rock', 'hard rock', 'mellow gold', 'progressive rock', 'rock', 'soft rock', 'symphonic rock'])
  'https://open.spotify.com/track/75n6R38rfp87ElycXr7OJq']]
Cluster 4
[['Sour Times - Live' 'Portishead'
  list(['alternative rock', 'art pop', 'dark pop', 'electronica', 'laboratorio', 'trip hop'])
  'https://open.spotify.com/track/5IdRnW0NJdKBolaTmMF7Ux']
 ['Cassandra Gemini: Pisacis (Phra-men-ma)' 'The Mars Volta'
  list(['el paso indie', 'garage rock'])
  'https://open.spotify.com/track/7KhhjECPeVmsu8CwDHQF0M']
 ['Mayonaise - 2011 Remaster' 'The Smashing Pumpkins'
  list(['alternative metal', 'alternative rock', 'grunge', 'permanent wave', 'rock', 'spacegrunge'])
  'https://open.spotify.com/track/0jmKzJmUEKNbC7eU8YfOiA']
 ['Love Trilogy' 'Red Hot Chili Peppers'
  list(['alternative rock', 'funk metal', 'funk rock', 'permanent wave', 'rock'])
  'https://open.spotify.com/track/0JCJoVgkgiPvC5hMgdCoJO']
 ['God Is Calling Me Back Home' 'King Gizzard & The Lizard Wizard'
  list(['australian psych', 'double drumming', 'microtonal', 'neo-psychedelic'])
  'https://open.spotify.com/track/5EQzn8CEclXUd2pNidVRG5']
 ['Monarchy of Roses' 'Red Hot Chili Peppers'
  list(['alternative rock', 'funk metal', 'funk rock', 'permanent wave', 'rock'])
  'https://open.spotify.com/track/16Bf9uR4PI2dCoXDIjs0cP']
 ['The Logical Song - Remastered 2010' 'Supertramp'
  list(['album rock', 'art rock', 'classic rock', 'glam rock', 'mellow gold', 'piano rock', 'progressive rock', 'rock', 'soft rock', 'symphonic rock'])
  'https://open.spotify.com/track/6mHOcVtsHLMuesJkswc0GZ']
 ['Phantom Bride' 'Deftones'
  list(['alternative metal', 'nu metal', 'rap metal', 'rock', 'sacramento indie'])
  'https://open.spotify.com/track/33qrQEXQJg4uk6k8fZgoOa']
 ['Empty Vessels Make the Loudest Sound' 'The Mars Volta'
  list(['el paso indie', 'garage rock'])
  'https://open.spotify.com/track/71BvPd5131rZJCsfJgU8Vu']
 ['15 Step' 'Radiohead'
  list(['alternative rock', 'art rock', 'melancholia', 'oxford indie', 'permanent wave', 'rock'])
  'https://open.spotify.com/track/4oXg7xT4ksBxHTx8PcmSXw']]
Cluster 5
[['Hangar 18' 'Megadeth'
  list(['alternative metal', 'hard rock', 'melodic thrash', 'metal', 'old school thrash', 'rock', 'speed metal', 'thrash metal'])
  'https://open.spotify.com/track/0KAaslGdPc5I6WxmKe3whe']
 ['Paranoid Android' 'Radiohead'
  list(['alternative rock', 'art rock', 'melancholia', 'oxford indie', 'permanent wave', 'rock'])
  'https://open.spotify.com/track/6LgJvl0Xdtc73RJ1mmpotq']
 ['Domination' 'Pantera'
  list(['alternative metal', 'groove metal', 'hard rock', 'metal', 'nu metal', 'old school thrash', 'rock', 'texas metal'])
  'https://open.spotify.com/track/769cLRTw2y6KRdkFWFkxtu']
 ['Eternal Life' 'Jeff Buckley'
  list(['melancholia', 'permanent wave', 'singer-songwriter'])
  'https://open.spotify.com/track/7bf4nfz09yp6w7L7r9hQ1V']
 ['Floods' 'Pantera'
  list(['alternative metal', 'groove metal', 'hard rock', 'metal', 'nu metal', 'old school thrash', 'rock', 'texas metal'])
  'https://open.spotify.com/track/1L2ZkXbRX00ZiaUDuMMgf7']
 ['Poisonous Shadows' 'Megadeth'
  list(['alternative metal', 'hard rock', 'melodic thrash', 'metal', 'old school thrash', 'rock', 'speed metal', 'thrash metal'])
  'https://open.spotify.com/track/1RdDBpGJDIsJTBou1QsJ9B']
 ['Peace Sells - 2004 Remaster' 'Megadeth'
  list(['alternative metal', 'hard rock', 'melodic thrash', 'metal', 'old school thrash', 'rock', 'speed metal', 'thrash metal'])
  'https://open.spotify.com/track/3090goAxG6IlpCifA8m9xB']
 ['Vitamin C' 'CAN'
  list(['experimental', 'experimental rock', 'krautrock', 'neo-psychedelic', 'no wave', 'noise pop', 'post-punk', 'space rock'])
  'https://open.spotify.com/track/0N9w2k0qrAYDHUyliycGD5']
 ['Nude' 'Radiohead'
  list(['alternative rock', 'art rock', 'melancholia', 'oxford indie', 'permanent wave', 'rock'])
  'https://open.spotify.com/track/5k7VKj1Xwy5DjO4B0PdAOb']
 ['There, There' 'Radiohead'
  list(['alternative rock', 'art rock', 'melancholia', 'oxford indie', 'permanent wave', 'rock'])
  'https://open.spotify.com/track/5h4y42RUKwYKYWgutNwvKP']]
Cluster 6
[['Alluda Majaka' 'King Gizzard & The Lizard Wizard'
  list(['australian psych', 'double drumming', 'microtonal', 'neo-psychedelic'])
  'https://open.spotify.com/track/03N0io094N0ZbVnFyti420']
 ['Nuclear Fusion' 'King Gizzard & The Lizard Wizard'
  list(['australian psych', 'double drumming', 'microtonal', 'neo-psychedelic'])
  'https://open.spotify.com/track/5iGPF5Vg43aZ2QTXslY6VA']
 ['The Holy Drinker' 'Steven Wilson'
  list(['progressive metal', 'progressive rock'])
  'https://open.spotify.com/track/6qfovhuEiazhpsIe4UqirU']
 ['Kalamazoo' 'Primus'
  list(['alternative metal', 'alternative rock', 'funk metal', 'funk rock', 'grunge', 'nu metal'])
  'https://open.spotify.com/track/0n3qQAR9kYNLDHV06vO3dD']
 ['Acceptance - Concealing Fate, Pt. 1' 'TesseracT'
  list(['djent', 'progressive metal'])
  'https://open.spotify.com/track/6zqrAN3xSWrTMTUE4vjgZB']
 ['Rosemary' 'Deftones'
  list(['alternative metal', 'nu metal', 'rap metal', 'rock', 'sacramento indie'])
  'https://open.spotify.com/track/4FEr6dIdH6EqLKR0jB560J']
 ['Phantoms' 'Meshuggah'
  list(['alternative metal', 'djent', 'groove metal', 'metal', 'nu metal', 'progressive groove metal', 'swedish metal', 'technical groove metal', 'technical thrash'])
  'https://open.spotify.com/track/3JnLFF2Zb6oOWf7l3gX4Zi']
 ['Slow Jam 1' 'King Gizzard & The Lizard Wizard'
  list(['australian psych', 'double drumming', 'microtonal', 'neo-psychedelic'])
  'https://open.spotify.com/track/5toJ4cpSm8EiAGaJFBmROG']
 ['Behind The Sun' 'Meshuggah'
  list(['alternative metal', 'djent', 'groove metal', 'metal', 'nu metal', 'progressive groove metal', 'swedish metal', 'technical groove metal', 'technical thrash'])
  'https://open.spotify.com/track/0ief86zDRCbcotaaYREsUN']
 ['Bleak' 'Opeth'
  list(['alternative metal', 'metal', 'progressive death metal', 'progressive metal', 'swedish metal', 'swedish progressive metal'])
  'https://open.spotify.com/track/0Nj4eoThAEPpmeSA6zWYEs']]
Cluster 7
[['Blue Morpho' 'King Gizzard & The Lizard Wizard'
  list(['australian psych', 'double drumming', 'microtonal', 'neo-psychedelic'])
  'https://open.spotify.com/track/0sajVraFsiilMTL1XkCDhJ']
 ['Plainsong - Remastered' 'The Cure'
  list(['new wave', 'permanent wave', 'rock', 'uk post-punk'])
  'https://open.spotify.com/track/4gcfxHL1iRgP0RHCDYMNIo']
 ['The Lord of Lightning' 'King Gizzard & The Lizard Wizard'
  list(['australian psych', 'double drumming', 'microtonal', 'neo-psychedelic'])
  'https://open.spotify.com/track/3FaZ3CRwOapV6ngeEpcrpO']
 ['Crumbling Castle' 'King Gizzard & The Lizard Wizard'
  list(['australian psych', 'double drumming', 'microtonal', 'neo-psychedelic'])
  'https://open.spotify.com/track/0NDidThJQ6nbftCXAbHjp5']
 ['Open Water' 'King Gizzard & The Lizard Wizard'
  list(['australian psych', 'double drumming', 'microtonal', 'neo-psychedelic'])
  'https://open.spotify.com/track/0odQCRSjtsGaRCkAgeeN1D']
 ['Faith' 'The Cure'
  list(['new wave', 'permanent wave', 'rock', 'uk post-punk'])
  'https://open.spotify.com/track/01Vrp6Y0VGOQALAshPxeGY']
 ['Magic Doors' 'Portishead'
  list(['alternative rock', 'art pop', 'dark pop', 'electronica', 'laboratorio', 'trip hop'])
  'https://open.spotify.com/track/7mGvICZ9CU9zpEnAyXEDwn']
 ['Perihelion' 'King Gizzard & The Lizard Wizard'
  list(['australian psych', 'double drumming', 'microtonal', 'neo-psychedelic'])
  'https://open.spotify.com/track/1ZfJ6mlKI1nVb7h7Imdksa']
 ['Ontology' 'King Gizzard & The Lizard Wizard'
  list(['australian psych', 'double drumming', 'microtonal', 'neo-psychedelic'])
  'https://open.spotify.com/track/5Pys5Lcouq7xrXha5rgC2i']
 ['K.G.L.W.' 'King Gizzard & The Lizard Wizard'
  list(['australian psych', 'double drumming', 'microtonal', 'neo-psychedelic'])
  'https://open.spotify.com/track/4QugdF1lazluZvplbfVjCR']]
Cluster 8
[['Red Miso' 'Animals As Leaders'
  list(['djent', 'instrumental djent', 'instrumental rock', 'jazz metal', 'progressive jazz fusion', 'progressive metal'])
  'https://open.spotify.com/track/2qJpcZnMUQNOdAPxsuhFFe']
 ['Kascade' 'Animals As Leaders'
  list(['djent', 'instrumental djent', 'instrumental rock', 'jazz metal', 'progressive jazz fusion', 'progressive metal'])
  'https://open.spotify.com/track/7hY2Kc7Hvu0BudOoQwu8Ez']
 ["Mind's Mirrors / In Death - Is Life / In Death - Is Death - Live"
  'Meshuggah'
  list(['alternative metal', 'djent', 'groove metal', 'metal', 'nu metal', 'progressive groove metal', 'swedish metal', 'technical groove metal', 'technical thrash'])
  'https://open.spotify.com/track/7snx3JnD5PxvEJ7SZeArA8']
 ['I Am Colossus' 'Meshuggah'
  list(['alternative metal', 'djent', 'groove metal', 'metal', 'nu metal', 'progressive groove metal', 'swedish metal', 'technical groove metal', 'technical thrash'])
  'https://open.spotify.com/track/3doXY6Mbglh7gPSV0d3eut']
 ['Gordian Naught' 'Animals As Leaders'
  list(['djent', 'instrumental djent', 'instrumental rock', 'jazz metal', 'progressive jazz fusion', 'progressive metal'])
  'https://open.spotify.com/track/7uhwNvGV8LaWoHsrawt6jD']
 ['Of Mind - Nocturne' 'TesseracT' list(['djent', 'progressive metal'])
  'https://open.spotify.com/track/7khQrdng5Zj4bkpS7sQr17']
 ['Wasteland' 'Rendezvous Point'
  list(['djent', 'progressive metal', 'sci-fi metal'])
  'https://open.spotify.com/track/2nJcjZeBQmghc6uiqeBKrx']
 ['Psychosphere' 'Periphery'
  list(['djent', 'melodic metalcore', 'progressive metal', 'progressive metalcore'])
  'https://open.spotify.com/track/5Lvy7YsyBbDBYbjCnfZ2SQ']
 ['Of Matter - Proxy' 'TesseracT' list(['djent', 'progressive metal'])
  'https://open.spotify.com/track/6CfwYtEnSDgMxosLZ5Vbiu']
 ['Beneath My Skin / Mirror Image' 'TesseracT'
  list(['djent', 'progressive metal'])
  'https://open.spotify.com/track/0pfuQHU0YfhmaHJ99W9lDb']]
Cluster 9
[["I'm Not In Your Mind" 'King Gizzard & The Lizard Wizard'
  list(['australian psych', 'double drumming', 'microtonal', 'neo-psychedelic'])
  'https://open.spotify.com/track/2yAST0CIUT6LBTobAcONLP']
 ['Culling Voices' 'TOOL'
  list(['alternative metal', 'art rock', 'nu metal', 'post-grunge', 'progressive metal', 'progressive rock', 'rock'])
  'https://open.spotify.com/track/3gPxMQWDMSEyPXMtzbcDdQ']
 ['Crumbling Castle' 'King Gizzard & The Lizard Wizard'
  list(['australian psych', 'double drumming', 'microtonal', 'neo-psychedelic'])
  'https://open.spotify.com/track/0NDidThJQ6nbftCXAbHjp5']
 ['Starless' 'King Crimson'
  list(['art rock', 'instrumental rock', 'jazz rock', 'progressive rock', 'psychedelic rock', 'symphonic rock', 'zolo'])
  'https://open.spotify.com/track/1Kt1j54YhvP39PnSQjU8H3']
 ['Hooker With A Penis' 'TOOL'
  list(['alternative metal', 'art rock', 'nu metal', 'post-grunge', 'progressive metal', 'progressive rock', 'rock'])
  'https://open.spotify.com/track/3S4G4SL15Cp4CvAfmye8um']
 ['My Hero' 'Foo Fighters'
  list(['alternative metal', 'alternative rock', 'modern rock', 'permanent wave', 'post-grunge', 'rock'])
  'https://open.spotify.com/track/4dVbhS6OiYvFikshyaQaCN']
 ['Scentless Apprentice' 'Nirvana'
  list(['grunge', 'permanent wave', 'rock'])
  'https://open.spotify.com/track/54UFDHWI2q7WHfrGbSNWph']
 ['Freak On a Leash' 'Korn'
  list(['alternative metal', 'funk metal', 'hard rock', 'nu metal', 'post-grunge', 'rap metal', 'rock'])
  'https://open.spotify.com/track/6W21LNLz9Sw7sUSNWMSHRu']
 ['Spegetti Western' 'Primus'
  list(['alternative metal', 'alternative rock', 'funk metal', 'funk rock', 'grunge', 'nu metal'])
  'https://open.spotify.com/track/4kK5dappKa8vYHS5rjrPvu']
 ['All My Life' 'Foo Fighters'
  list(['alternative metal', 'alternative rock', 'modern rock', 'permanent wave', 'post-grunge', 'rock'])
  'https://open.spotify.com/track/6tsojOQ5wHaIjKqIryLZK6']]

Cluster 0 seems somewhat strange at first, as most of it seems mixed metal (and one Rush song, which is definetely NOT metal), but it could be sensible to group most of these tracks.

Cluster 1 seems more or less ok, most of it comprises songs with a rhytmic element very present, but there are some tracks where rythm is not the main focus.

Cluster 2 also seems ok, with mostly jazz.

Cluster 3 is ok, and being one of the most "big" groups (cluster with several elements in it), it should comprise several styles that could be viewed as "general rock".

Cluster 4 also seems mostly ok, although being one with more "weird" genres, it's also one of the most open to interpretation.

Cluster 5 seems a little weird, as it has thrash and similar genres, but also has more soft alt rock, like Radiohead.

Cluster 6 has mostly prog.

Cluster 7 is a strange mix, but the songs seem to correspond to the cluster.

Cluster 8 seems ok also, with mostly djent and similar proggish metal.

Finally, cluster 9 seems a little hit-or-miss, with most of the songs not really corresponding to the expectations.

In conclusion, this approach is the most simple, and gives somewhat sensible results. Some tracks may be off in some cluster, and one of the main causes of this problem is the fact that genre is specified for the whole artist, so if an artist has diversity in genres, it may have songs in some clusters that don't seem ok, but the whole artist could be categorized in that way.

In [14]:
for col in categorical_cols:    
    fig = go.Figure()
    for i in range(0, 10):
        _df = df_exploded.loc[df_exploded["cluster_NLP"] == str(i), :]
        fig.add_trace(go.Histogram(x=_df[col], histnorm="percent", name=f"Cluster {i}"))
        
    fig.update_layout(
        xaxis_title_text=col,
        yaxis_title_text='%'
    )
    fig.update_traces(opacity=0.75)
    fig.show()
In [15]:
for col in numerical_cols:
    if col in track_info.columns and col in public_data.columns:
        fig = go.Figure()
        
        for i in range(0, 10):
            _df = df_exploded.loc[df_exploded["cluster_NLP"] == str(i), :]
            bin_start = _df[col].min()
            bin_end = _df[col].max()
            bin_size = abs(bin_end - bin_start) / 20
            fig.add_trace(go.Histogram(x=_df[col], histnorm="percent", name=f"Cluster {i}"))

        fig.update_layout(
            barmode="overlay",
            xaxis_title_text=col,
            yaxis_title_text='%'
        )
        fig.update_traces(opacity=0.75)
        fig.show()

By features¶

In the next section, the actual features of the tracks are used to try to construct clusters.

First, an EDA is done, to see correlations between numerical features and select those who are not correlated, and construct new features based on combinations of some. The objective is to have a final set of uncorrelated features and use them as input of the clustering algorithm.

In [16]:
track_info["genres"] = track_info["genres"].apply(lambda x: list(set(x) & set(genres)))
track_info = track_info.loc[track_info["genres"].str.len() != 0]

correlations = track_info[numerical_cols].corr().abs()

fig = px.imshow(correlations, text_auto=True)
fig.update_layout(width=1300, height=1200, autosize=False)

fig.show()

Most of the features seem uncorrelated, but some of them have correlation. Number of sections and loudnesd are correlated with several other features, so these are dropped out completely. The following pairs are correlated between them, so for each one, a new feature is calculated multiplying the two:

- Valence and danceability
- Energy and acousticness
- Energy and speechiness
- Key changes and mode changes
In [17]:
track_info_new_features = track_info.copy()

useless_cols = ["num_sections", "loudness"] 
numerical_cols_extra = list(set(numerical_cols).difference(useless_cols))

track_info_new_features = track_info_new_features[track_info_new_features.columns.difference(useless_cols)]

for pair in (("valence", "danceability"), ("energy", "acousticness"), ("energy", "speechiness"), ("key_changes", "mode_changes")):
    track_info_new_features[f"{pair[0]} - {pair[1]}"] = track_info_new_features[pair[0]] * track_info_new_features[pair[1]]
    numerical_cols_extra.append(f"{pair[0]} - {pair[1]}")

correlations = track_info_new_features[numerical_cols_extra].corr().abs()

fig = px.imshow(correlations, text_auto=True)
fig.update_layout(width=1300, height=1200, autosize=False)

fig.show()

Mode changes and key changes seem correlated, but plotting them, it doesn't seem to exist a direct and clear relation between them, so for these, instead of constructing a new feature, they are used as they were originally.

(values for these features are normalized with the song duration)

In [18]:
track_info.loc[:, ["mode_changes", "key_changes"]].plot(kind="scatter", y="mode_changes", x="key_changes")
Out[18]:
<Axes: xlabel='key_changes', ylabel='mode_changes'>

The final correlation matrix is shown, where it can be seen that almost of the features are uncorrelated between them, so they are good candidated for clustering.

In [30]:
useless_cols = ["num_sections", "loudness", "energy", "acousticness", "valence", "danceability", "speechiness"] 
numerical_cols_extra = list(set(numerical_cols).difference(useless_cols))
numerical_cols_extra = numerical_cols_extra + ["valence - danceability", "energy - acousticness", "energy - speechiness"]

track_info_new_features = track_info_new_features[track_info_new_features.columns.difference(useless_cols)]

correlations = track_info_new_features[numerical_cols_extra].corr().abs()

fig = px.imshow(correlations, text_auto=True)

fig.update(layout_coloraxis_showscale=False)

fig.show()

py.plot(fig, filename="correlations_clean_clustering", auto_open=False)
Out[30]:
'https://plotly.com/~jcf94/7/'

These features are scaled, and the categorical features are also added and encoded using one-hot enconding.

In [20]:
track_info_new_features = track_info.copy()
useless_cols = ["num_sections", "loudness", "energy", "acousticness", "valence", "danceability", "speechiness"] 
combinations_cols = [("valence", "danceability"), ("energy", "acousticness"), ("energy", "speechiness")]
numerical_cols_extra = list(set(numerical_cols).difference(useless_cols))

for pair in combinations_cols:
    track_info_new_features[f"{pair[0]} - {pair[1]}"] = track_info_new_features[pair[0]] * track_info_new_features[pair[1]]
    numerical_cols_extra.append(f"{pair[0]} - {pair[1]}")
    
numerical_cols_extra = numerical_cols_extra + categorical_cols + [f"{pair[0]} - {pair[1]}" for pair in combinations_cols]

X = track_info_new_features[numerical_cols_extra]
X = pd.get_dummies(X, columns=categorical_cols)

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

KMeans is applied to the features to try to cluster them. 15 groups are used.

In [21]:
kmeans = KMeans(n_clusters=10, random_state=random_state)
kmeans.fit(X)
labels = kmeans.labels_

track_info_labeled = track_info_new_features.copy()
track_info_labeled["cluster"] = labels

genres = track_info_new_features["genres"]

# Print the genres grouped by cluster
clusters = {}
for genres, label in zip(genres, labels):
    if label not in clusters:
        clusters[label] = []
    clusters[label].extend(genres)

for k, v in clusters.items():
    clusters[k] = Counter(v)

clusters = dict(sorted(clusters.items()))

for cluster_id, genre_list in clusters.items():
    print(f"Cluster {cluster_id} (length: {len(genre_list)}): {genre_list}\n")
Cluster 0 (length: 57): Counter({'progressive metal': 27, 'progressive rock': 26, 'rock': 25, 'art rock': 18, 'hard rock': 17, 'classic rock': 16, 'album rock': 16, 'symphonic rock': 13, 'metal': 12, 'soft rock': 10, 'alternative metal': 9, 'funk rock': 8, 'nu metal': 8, 'jazz': 8, 'progressive jazz fusion': 6, 'contemporary jazz': 6, 'mellow gold': 6, 'djent': 6, 'post-grunge': 5, 'double drumming': 4, 'australian psych': 4, 'neo-psychedelic': 4, 'microtonal': 4, 'jam band': 3, 'psychedelic rock': 3, 'canadian metal': 3, 'classic canadian rock': 3, 'thrash metal': 3, 'old school thrash': 3, 'french metal': 3, 'el paso indie': 3, 'garage rock': 3, 'instrumental rock': 2, 'alternative rock': 2, 'dance pop': 2, 'french death metal': 2, 'groove metal': 2, 'progressive groove metal': 2, 'melancholia': 2, 'indie rock': 2, 'jazz funk': 2, 'jazz fusion': 2, 'funk': 1, 'p funk': 1, 'jazz rock': 1, 'zolo': 1, 'funk metal': 1, 'grunge': 1, 'post-rock': 1, 'shoegaze': 1, 'instrumental funk': 1, 'electric bass': 1, 'progressive death metal': 1, 'swedish metal': 1, 'swedish progressive metal': 1, 'permanent wave': 1, 'modern rock': 1})

Cluster 1 (length: 80): Counter({'rock': 40, 'alternative rock': 25, 'alternative metal': 21, 'progressive metal': 18, 'permanent wave': 18, 'hard rock': 17, 'progressive rock': 15, 'nu metal': 13, 'art rock': 12, 'classic rock': 11, 'instrumental rock': 11, 'djent': 11, 'metal': 11, 'modern rock': 10, 'album rock': 9, 'symphonic rock': 8, 'funk metal': 7, 'psychedelic rock': 7, 'funk rock': 6, 'old school thrash': 6, 'grunge': 5, 'jazz rock': 5, 'zolo': 5, 'post-grunge': 5, 'el paso indie': 5, 'groove metal': 5, 'melodic thrash': 5, 'thrash metal': 5, 'speed metal': 5, 'progressive jazz fusion': 5, 'canadian metal': 4, 'classic canadian rock': 4, 'progressive groove metal': 4, 'instrumental djent': 4, 'jazz metal': 4, 'rap rock': 3, 'conscious hip hop': 3, 'rap metal': 3, 'political hip hop': 3, 'garage rock': 3, 'swedish metal': 3, 'supergroup': 3, 'mellow gold': 3, 'dance pop': 3, 'melancholia': 3, 'math rock': 3, 'indie rock': 3, 'stoner rock': 2, 'palm desert scene': 2, 'trip hop': 2, 'microtonal': 2, 'technical groove metal': 2, 'technical thrash': 2, 'glam rock': 2, 'funk': 2, 'p funk': 2, 'swedish progressive metal': 2, 'french death metal': 2, 'french metal': 2, 'singer-songwriter': 2, 'blues rock': 2, 'soft rock': 2, 'jazz': 2, 'instrumental math rock': 2, 'jam band': 2, 'shoegaze': 2, 'stoner metal': 1, 'double drumming': 1, 'australian psych': 1, 'neo-psychedelic': 1, 'acid rock': 1, 'contemporary jazz': 1, 'oxford indie': 1, 'new wave': 1, 'uk post-punk': 1, 'electric bass': 1, 'instrumental funk': 1, 'jazz funk': 1, 'jazz fusion': 1, 'noise pop': 1})

Cluster 2 (length: 21): Counter({'progressive metal': 5, 'progressive rock': 3, 'symphonic rock': 3, 'metal': 2, 'classic rock': 2, 'hard rock': 2, 'art rock': 2, 'mellow gold': 2, 'album rock': 2, 'soft rock': 2, 'rock': 1, 'singer-songwriter': 1, 'blues rock': 1, 'djent': 1, 'alternative metal': 1, 'groove metal': 1, 'progressive groove metal': 1, 'technical groove metal': 1, 'swedish metal': 1, 'technical thrash': 1, 'nu metal': 1})

Cluster 3 (length: 49): Counter({'progressive rock': 28, 'art rock': 21, 'rock': 21, 'progressive metal': 21, 'classic rock': 16, 'symphonic rock': 16, 'album rock': 15, 'hard rock': 11, 'alternative metal': 11, 'metal': 9, 'psychedelic rock': 9, 'soft rock': 7, 'nu metal': 6, 'post-grunge': 5, 'instrumental rock': 5, 'jazz rock': 5, 'zolo': 5, 'funk rock': 4, 'progressive death metal': 4, 'swedish metal': 4, 'swedish progressive metal': 4, 'el paso indie': 4, 'garage rock': 4, 'jazz': 3, 'canadian metal': 3, 'classic canadian rock': 3, 'mellow gold': 3, 'djent': 2, 'progressive jazz fusion': 2, 'contemporary jazz': 2, 'double drumming': 2, 'australian psych': 2, 'neo-psychedelic': 2, 'microtonal': 2, 'alternative rock': 1, 'funk metal': 1, 'grunge': 1, 'stoner rock': 1, 'stoner metal': 1, 'progressive groove metal': 1, 'thrash metal': 1, 'old school thrash': 1, 'jazz funk': 1, 'electric bass': 1, 'jazz fusion': 1, 'blues rock': 1, 'glam rock': 1, 'funk': 1, 'p funk': 1})

Cluster 4 (length: 71): Counter({'rock': 56, 'alternative rock': 41, 'permanent wave': 26, 'alternative metal': 24, 'funk rock': 23, 'funk metal': 21, 'classic rock': 17, 'album rock': 17, 'hard rock': 17, 'nu metal': 17, 'grunge': 13, 'psychedelic rock': 11, 'art rock': 10, 'progressive rock': 10, 'instrumental rock': 9, 'modern rock': 8, 'symphonic rock': 7, 'metal': 7, 'neo-psychedelic': 7, 'microtonal': 7, 'acid rock': 6, 'progressive metal': 6, 'double drumming': 6, 'australian psych': 6, 'rap metal': 6, 'french metal': 5, 'el paso indie': 5, 'instrumental funk': 5, 'noise pop': 5, 'post-grunge': 5, 'djent': 4, 'garage rock': 4, 'oxford indie': 4, 'melancholia': 4, 'shoegaze': 4, 'mellow gold': 3, 'soft rock': 3, 'french death metal': 3, 'groove metal': 3, 'progressive groove metal': 3, 'jazz': 3, 'jazz fusion': 3, 'instrumental math rock': 3, 'canadian metal': 2, 'classic canadian rock': 2, 'stoner rock': 2, 'stoner metal': 2, 'palm desert scene': 2, 'jazz funk': 2, 'electric bass': 2, 'melodic thrash': 2, 'thrash metal': 2, 'speed metal': 2, 'old school thrash': 2, 'rap rock': 2, 'conscious hip hop': 2, 'political hip hop': 2, 'sacramento indie': 2, 'indie rock': 2, 'progressive jazz fusion': 2, 'dance pop': 2, 'post-rock': 2, 'math rock': 2, 'jam band': 1, 'funk': 1, 'p funk': 1, 'singer-songwriter': 1, 'supergroup': 1, 'blues rock': 1, 'instrumental djent': 1, 'jazz metal': 1})

Cluster 5 (length: 34): Counter({'progressive metal': 12, 'progressive rock': 8, 'metal': 7, 'alternative metal': 7, 'art rock': 6, 'rock': 5, 'nu metal': 4, 'post-grunge': 3, 'swedish metal': 3, 'symphonic rock': 3, 'psychedelic rock': 3, 'el paso indie': 3, 'garage rock': 3, 'progressive death metal': 2, 'swedish progressive metal': 2, 'instrumental rock': 2, 'jazz rock': 2, 'zolo': 2, 'djent': 2, 'funk rock': 1, 'jazz': 1, 'progressive jazz fusion': 1, 'contemporary jazz': 1, 'hard rock': 1, 'double drumming': 1, 'australian psych': 1, 'neo-psychedelic': 1, 'microtonal': 1, 'groove metal': 1, 'progressive groove metal': 1, 'technical groove metal': 1, 'technical thrash': 1, 'classic rock': 1, 'album rock': 1})

Cluster 6 (length: 77): Counter({'rock': 106, 'alternative rock': 86, 'permanent wave': 71, 'alternative metal': 47, 'funk metal': 39, 'funk rock': 38, 'nu metal': 36, 'art rock': 25, 'modern rock': 23, 'melancholia': 20, 'metal': 19, 'progressive metal': 19, 'oxford indie': 18, 'instrumental rock': 15, 'hard rock': 15, 'groove metal': 14, 'grunge': 13, 'classic rock': 13, 'progressive groove metal': 13, 'progressive rock': 12, 'french metal': 12, 'post-grunge': 11, 'album rock': 11, 'french death metal': 11, 'rap metal': 10, 'djent': 9, 'dance pop': 8, 'swedish metal': 7, 'double drumming': 7, 'australian psych': 7, 'neo-psychedelic': 7, 'microtonal': 7, 'progressive jazz fusion': 6, 'symphonic rock': 6, 'indie rock': 5, 'sacramento indie': 5, 'swedish progressive metal': 5, 'electric bass': 4, 'jam band': 4, 'psychedelic rock': 4, 'el paso indie': 4, 'supergroup': 4, 'stoner rock': 3, 'old school thrash': 3, 'soft rock': 3, 'new wave': 3, 'uk post-punk': 3, 'trip hop': 3, 'jazz funk': 3, 'jazz fusion': 3, 'jazz': 3, 'instrumental djent': 3, 'jazz metal': 3, 'glam rock': 3, 'stoner metal': 2, 'palm desert scene': 2, 'instrumental funk': 2, 'melodic thrash': 2, 'thrash metal': 2, 'speed metal': 2, 'technical groove metal': 2, 'technical thrash': 2, 'mellow gold': 2, 'garage rock': 2, 'contemporary jazz': 2, 'instrumental math rock': 2, 'math rock': 2, 'funk': 1, 'p funk': 1, 'blues rock': 1, 'progressive death metal': 1, 'singer-songwriter': 1, 'rap rock': 1, 'conscious hip hop': 1, 'political hip hop': 1, 'canadian metal': 1, 'classic canadian rock': 1})

Cluster 7 (length: 70): Counter({'rock': 39, 'alternative metal': 29, 'alternative rock': 28, 'permanent wave': 21, 'nu metal': 18, 'funk rock': 17, 'progressive metal': 16, 'metal': 16, 'progressive rock': 15, 'art rock': 12, 'funk metal': 11, 'grunge': 10, 'classic rock': 10, 'groove metal': 10, 'progressive groove metal': 10, 'modern rock': 9, 'djent': 9, 'hard rock': 9, 'instrumental rock': 9, 'symphonic rock': 8, 'jazz': 8, 'french death metal': 8, 'french metal': 8, 'progressive jazz fusion': 8, 'contemporary jazz': 7, 'melancholia': 6, 'album rock': 6, 'jazz fusion': 6, 'oxford indie': 5, 'stoner rock': 5, 'psychedelic rock': 5, 'jazz funk': 5, 'electric bass': 5, 'stoner metal': 4, 'palm desert scene': 4, 'dance pop': 4, 'microtonal': 4, 'australian psych': 4, 'el paso indie': 4, 'garage rock': 4, 'instrumental funk': 3, 'melodic thrash': 3, 'thrash metal': 3, 'speed metal': 3, 'old school thrash': 3, 'swedish metal': 3, 'instrumental math rock': 3, 'math rock': 3, 'blues rock': 3, 'jazz rock': 3, 'zolo': 3, 'post-grunge': 2, 'funk': 2, 'p funk': 2, 'new wave': 2, 'uk post-punk': 2, 'double drumming': 2, 'neo-psychedelic': 2, 'canadian metal': 2, 'classic canadian rock': 2, 'technical groove metal': 2, 'technical thrash': 2, 'trip hop': 1, 'progressive death metal': 1, 'swedish progressive metal': 1, 'singer-songwriter': 1, 'glam rock': 1, 'rap metal': 1, 'sacramento indie': 1, 'post-rock': 1})

Cluster 8 (length: 67): Counter({'rock': 44, 'alternative rock': 28, 'permanent wave': 27, 'alternative metal': 19, 'modern rock': 14, 'el paso indie': 11, 'nu metal': 11, 'hard rock': 10, 'art rock': 10, 'grunge': 8, 'metal': 8, 'garage rock': 8, 'progressive metal': 8, 'instrumental rock': 7, 'post-grunge': 7, 'oxford indie': 7, 'melancholia': 7, 'thrash metal': 5, 'old school thrash': 5, 'rap metal': 5, 'djent': 5, 'progressive rock': 4, 'melodic thrash': 4, 'speed metal': 4, 'funk metal': 4, 'double drumming': 4, 'australian psych': 4, 'neo-psychedelic': 4, 'microtonal': 4, 'symphonic rock': 3, 'psychedelic rock': 3, 'dance pop': 3, 'classic rock': 3, 'progressive jazz fusion': 3, 'french death metal': 3, 'groove metal': 3, 'progressive groove metal': 3, 'french metal': 3, 'instrumental djent': 2, 'jazz metal': 2, 'funk rock': 2, 'album rock': 2, 'instrumental funk': 2, 'trip hop': 1, 'jazz rock': 1, 'zolo': 1, 'rap rock': 1, 'conscious hip hop': 1, 'political hip hop': 1, 'jam band': 1, 'shoegaze': 1, 'noise pop': 1, 'stoner rock': 1, 'stoner metal': 1, 'palm desert scene': 1, 'instrumental math rock': 1, 'new wave': 1, 'uk post-punk': 1, 'singer-songwriter': 1, 'mellow gold': 1, 'blues rock': 1, 'soft rock': 1, 'electric bass': 1, 'jazz': 1, 'jazz funk': 1, 'jazz fusion': 1, 'supergroup': 1})

Cluster 9 (length: 69): Counter({'rock': 34, 'alternative metal': 30, 'nu metal': 25, 'hard rock': 19, 'metal': 19, 'alternative rock': 18, 'progressive metal': 15, 'classic rock': 11, 'progressive rock': 11, 'grunge': 10, 'funk rock': 9, 'album rock': 9, 'groove metal': 9, 'art rock': 8, 'progressive groove metal': 8, 'el paso indie': 8, 'funk metal': 7, 'old school thrash': 7, 'permanent wave': 7, 'garage rock': 7, 'thrash metal': 6, 'french death metal': 6, 'french metal': 6, 'djent': 5, 'symphonic rock': 5, 'instrumental rock': 4, 'modern rock': 4, 'psychedelic rock': 4, 'post-rock': 3, 'math rock': 3, 'rap metal': 3, 'melodic thrash': 3, 'speed metal': 3, 'mellow gold': 3, 'microtonal': 3, 'post-grunge': 3, 'sacramento indie': 2, 'jazz': 2, 'progressive jazz fusion': 2, 'contemporary jazz': 2, 'melancholia': 2, 'technical groove metal': 2, 'swedish metal': 2, 'technical thrash': 2, 'supergroup': 2, 'glam rock': 2, 'jam band': 2, 'double drumming': 2, 'australian psych': 2, 'neo-psychedelic': 2, 'soft rock': 2, 'instrumental math rock': 2, 'zolo': 2, 'new wave': 2, 'uk post-punk': 2, 'stoner rock': 1, 'oxford indie': 1, 'dance pop': 1, 'singer-songwriter': 1, 'blues rock': 1, 'noise pop': 1, 'instrumental funk': 1, 'jazz rock': 1, 'indie rock': 1, 'rap rock': 1, 'conscious hip hop': 1, 'political hip hop': 1, 'canadian metal': 1, 'classic canadian rock': 1})

In [22]:
for cluster_id, genre_list in clusters.items():
    genre_list_filtered = {k:v for k,v in genre_list.items() if v >= 5}
    print(f"Cluster {cluster_id} (length: {len(genre_list_filtered)}): {genre_list_filtered}\n")
Cluster 0 (length: 19): {'funk rock': 8, 'alternative metal': 9, 'art rock': 18, 'progressive rock': 26, 'progressive metal': 27, 'nu metal': 8, 'rock': 25, 'post-grunge': 5, 'jazz': 8, 'progressive jazz fusion': 6, 'contemporary jazz': 6, 'classic rock': 16, 'hard rock': 17, 'mellow gold': 6, 'album rock': 16, 'soft rock': 10, 'symphonic rock': 13, 'djent': 6, 'metal': 12}

Cluster 1 (length: 30): {'classic rock': 11, 'hard rock': 17, 'progressive rock': 15, 'album rock': 9, 'rock': 40, 'alternative rock': 25, 'alternative metal': 21, 'grunge': 5, 'modern rock': 10, 'funk metal': 7, 'funk rock': 6, 'nu metal': 13, 'instrumental rock': 11, 'art rock': 12, 'jazz rock': 5, 'zolo': 5, 'symphonic rock': 8, 'psychedelic rock': 7, 'djent': 11, 'progressive metal': 18, 'permanent wave': 18, 'post-grunge': 5, 'el paso indie': 5, 'groove metal': 5, 'metal': 11, 'melodic thrash': 5, 'thrash metal': 5, 'speed metal': 5, 'old school thrash': 6, 'progressive jazz fusion': 5}

Cluster 2 (length: 1): {'progressive metal': 5}

Cluster 3 (length: 17): {'classic rock': 16, 'hard rock': 11, 'art rock': 21, 'progressive rock': 28, 'album rock': 15, 'soft rock': 7, 'rock': 21, 'symphonic rock': 16, 'alternative metal': 11, 'nu metal': 6, 'progressive metal': 21, 'metal': 9, 'post-grunge': 5, 'psychedelic rock': 9, 'instrumental rock': 5, 'jazz rock': 5, 'zolo': 5}

Cluster 4 (length: 30): {'classic rock': 17, 'art rock': 10, 'progressive rock': 10, 'album rock': 17, 'symphonic rock': 7, 'alternative rock': 41, 'acid rock': 6, 'hard rock': 17, 'rock': 56, 'psychedelic rock': 11, 'progressive metal': 6, 'metal': 7, 'funk metal': 21, 'funk rock': 23, 'permanent wave': 26, 'alternative metal': 24, 'grunge': 13, 'modern rock': 8, 'french metal': 5, 'nu metal': 17, 'instrumental rock': 9, 'el paso indie': 5, 'instrumental funk': 5, 'noise pop': 5, 'neo-psychedelic': 7, 'microtonal': 7, 'double drumming': 6, 'australian psych': 6, 'rap metal': 6, 'post-grunge': 5}

Cluster 5 (length: 6): {'progressive metal': 12, 'metal': 7, 'alternative metal': 7, 'art rock': 6, 'progressive rock': 8, 'rock': 5}

Cluster 6 (length: 37): {'alternative rock': 86, 'rock': 106, 'indie rock': 5, 'modern rock': 23, 'instrumental rock': 15, 'progressive jazz fusion': 6, 'dance pop': 8, 'alternative metal': 47, 'grunge': 13, 'oxford indie': 18, 'art rock': 25, 'permanent wave': 71, 'melancholia': 20, 'funk metal': 39, 'funk rock': 38, 'post-grunge': 11, 'rap metal': 10, 'sacramento indie': 5, 'nu metal': 36, 'hard rock': 15, 'metal': 19, 'classic rock': 13, 'album rock': 11, 'djent': 9, 'groove metal': 14, 'progressive groove metal': 13, 'swedish metal': 7, 'progressive metal': 19, 'swedish progressive metal': 5, 'progressive rock': 12, 'symphonic rock': 6, 'french death metal': 11, 'french metal': 12, 'double drumming': 7, 'australian psych': 7, 'neo-psychedelic': 7, 'microtonal': 7}

Cluster 7 (length: 33): {'alternative rock': 28, 'oxford indie': 5, 'art rock': 12, 'permanent wave': 21, 'rock': 39, 'melancholia': 6, 'alternative metal': 29, 'stoner rock': 5, 'grunge': 10, 'modern rock': 9, 'classic rock': 10, 'progressive rock': 15, 'album rock': 6, 'symphonic rock': 8, 'psychedelic rock': 5, 'progressive metal': 16, 'metal': 16, 'jazz funk': 5, 'electric bass': 5, 'jazz': 8, 'jazz fusion': 6, 'funk metal': 11, 'funk rock': 17, 'french death metal': 8, 'groove metal': 10, 'progressive groove metal': 10, 'french metal': 8, 'nu metal': 18, 'djent': 9, 'progressive jazz fusion': 8, 'contemporary jazz': 7, 'hard rock': 9, 'instrumental rock': 9}

Cluster 8 (length: 21): {'permanent wave': 27, 'grunge': 8, 'rock': 44, 'alternative rock': 28, 'hard rock': 10, 'thrash metal': 5, 'old school thrash': 5, 'metal': 8, 'el paso indie': 11, 'garage rock': 8, 'progressive metal': 8, 'alternative metal': 19, 'rap metal': 5, 'nu metal': 11, 'instrumental rock': 7, 'art rock': 10, 'djent': 5, 'post-grunge': 7, 'modern rock': 14, 'oxford indie': 7, 'melancholia': 7}

Cluster 9 (length: 25): {'alternative metal': 30, 'rock': 34, 'nu metal': 25, 'alternative rock': 18, 'funk metal': 7, 'grunge': 10, 'funk rock': 9, 'classic rock': 11, 'hard rock': 19, 'album rock': 9, 'metal': 19, 'thrash metal': 6, 'old school thrash': 7, 'permanent wave': 7, 'art rock': 8, 'progressive metal': 15, 'djent': 5, 'groove metal': 9, 'progressive groove metal': 8, 'el paso indie': 8, 'garage rock': 7, 'french death metal': 6, 'french metal': 6, 'progressive rock': 11, 'symphonic rock': 5}

As most tracks have several genres, this poses a problem grouping, as when training the model, there are several rows with the same input features, but a different "target" (as it's a non-supervised ML model, there is no real target in the model, but the genres that a track pertains are known, and we use this information to assess the clustering), there are genres that can be in several clusters. Although this could help to generalize the model, with a dataset of this size (with unequal distribution of genres also), it's more problematic.

Clusters are not explained in detail, but it can be seen that some of them are very huge (cluster 0 contains around 75% of the genres in the dataset), and they are all very similar in general (they all contain mixes of rock and metal). It does not seem a good clustering, as there is no clear distinction between genres. A filtered list is made, with only the genres in each cluster that appear more than 5 times, to exclude those with low representation. This filtering can help, but the results show that the clusters are still very mixed, without a clear sepparation.

In [23]:
for i in range(0, 15):
    _df = track_info_labeled.loc[track_info_labeled["cluster"] == i, ["track_name", "artist", "track_url"]]
    elements = min(5, _df.shape[0])
    print(f"Cluster {i}")
    print(_df.sample(elements).values)
Cluster 0
[['Natural Science' 'Rush'
  'https://open.spotify.com/track/3sFgH9hqMB6cySLWJOGAJ4']
 ['Cockroach King' 'Haken'
  'https://open.spotify.com/track/0J1CIpL8IuUf0OyijwkMFj']
 ['Dawn' 'Gojira' 'https://open.spotify.com/track/4AaSZHSg7XzYIxekKHu0uT']
 ['Opiate²' 'TOOL'
  'https://open.spotify.com/track/6iQDmWrbrMQ0vPfbKqqvKU']
 ['Wars of Armageddon' 'Funkadelic'
  'https://open.spotify.com/track/0m3C4IR5NsZnRPafCjvTFZ']]
Cluster 1
[['Voodoo Child (Slight Return)' 'Jimi Hendrix'
  'https://open.spotify.com/track/2AxCeJ6PSsBYiTckM0HLY7']
 ['The Noose' 'A Perfect Circle'
  'https://open.spotify.com/track/6lvNLD1XRU5paMwWH0RGRI']
 ['Finger' 'Elephant Gym'
  'https://open.spotify.com/track/2WsaBLIudbpWGYvEBMhvFi']
 ['Megalomania' 'Muse'
  'https://open.spotify.com/track/2S9tY6X04CTb9ZAA2PCpC2']
 ['Sleepless' 'King Crimson'
  'https://open.spotify.com/track/0JwoMuwai5esGaMOkEAXCF']]
Cluster 2
[['Visions - remastered 2017' 'Haken'
  'https://open.spotify.com/track/4oBKsjV5vzQ4pIlkp6Jfk4']
 ['Tubular Bells - Pt. I' 'Mike Oldfield'
  'https://open.spotify.com/track/7ERSQrRptZVM7q3VOdM7OL']
 ['Octavarium' 'Dream Theater'
  'https://open.spotify.com/track/4TZo49HN2MkbWmHMTf4NcH']
 ['I (21 Minute Track)' 'Meshuggah'
  'https://open.spotify.com/track/6mnLcQF5GYlFycGu2htkvb']
 ['Venus & Mars' 'Jack The Joker'
  'https://open.spotify.com/track/3VnrO2i3uFCkXE6n5ROvkR']]
Cluster 3
[['Lingus' 'Snarky Puppy'
  'https://open.spotify.com/track/68d6ZfyMUYURol2y15Ta2Y']
 ['The River' 'King Gizzard & The Lizard Wizard'
  'https://open.spotify.com/track/7p3V6QKXakAO8go0LhHFNP']
 ['America - 2003 Remaster' 'Yes'
  'https://open.spotify.com/track/5jleoXgOf41HdLITE0Omu3']
 ['Beneath My Skin / Mirror Image' 'TesseracT'
  'https://open.spotify.com/track/0pfuQHU0YfhmaHJ99W9lDb']
 ['Soul Sacrifice - Live at The Woodstock Music & Art Fair, August 16, 1969'
  'Santana' 'https://open.spotify.com/track/7zAoVpCLJFsyRfCbGUIAFf']]
Cluster 4
[['If You Have to Ask' 'Red Hot Chili Peppers'
  'https://open.spotify.com/track/2R6go62CuxqqX0w1TgXxes']
 ['Epiphany - Concealing Fate, Pt. 5' 'TesseracT'
  'https://open.spotify.com/track/42A9pLHYVYum5lbq4lGPif']
 ['E-Pro' 'Beck' 'https://open.spotify.com/track/01MBhRpvFkbeRwAp7gcF2W']
 ['Prague' 'Muse' 'https://open.spotify.com/track/1PQ6B0C26xAWv9Z5u2aKT0']
 ['The Gallery' 'Muse'
  'https://open.spotify.com/track/2b7lgdr7kkAnsyAJ1Tdp2V']]
Cluster 5
[['Rime of the Ancient Mariner - 2015 Remaster' 'Iron Maiden'
  'https://open.spotify.com/track/6K8ROjiPJqyHDJS0sA0dwH']
 ['7empest' 'TOOL'
  'https://open.spotify.com/track/0gGfmw4csswZmFPj9YK8GW']
 ['The Curtain - Live From Dordrecht, Het Energiehuis / 2014'
  'Snarky Puppy' 'https://open.spotify.com/track/29ls98FgNdZHbmqdQeF7E6']
 ["Larks' Tongue in Aspic, Pt. IV (incl. Coda: I Have A Dream)"
  'King Crimson' 'https://open.spotify.com/track/1altDGwFtoHP7QxzivXlSI']
 ['The Dripping Tap' 'King Gizzard & The Lizard Wizard'
  'https://open.spotify.com/track/0o6rOggbaLEvtwUHNztuD2']]
Cluster 6
[['Minerva' 'Deftones'
  'https://open.spotify.com/track/1gzWd0ILFaCoHUfQSkCIvl']
 ['All Secrets Known' 'Alice In Chains'
  'https://open.spotify.com/track/0vQDhuk73PvmaloRibUiQr']
 ['glimmer' 'Covet'
  'https://open.spotify.com/track/4kEhypsV7I5JAwHAesTKqR']
 ["L'enfant sauvage" 'Gojira'
  'https://open.spotify.com/track/5lOUVddyItbbzMTB1PqISs']
 ['Cassandra Gemini' 'The Mars Volta'
  'https://open.spotify.com/track/7niwGnlrDXwKtNteHbix2i']]
Cluster 7
[['Ilyena' 'The Mars Volta'
  'https://open.spotify.com/track/3MCXQZBWRPsSqzXxQbspT4']
 ['Selenium Forest' 'Plini'
  'https://open.spotify.com/track/18pIsa1XH5Eap4SBcSH4Xd']
 ['Ashes In Your Mouth - 2004 Remastered' 'Megadeth'
  'https://open.spotify.com/track/2Fkgr7LZTAjakkceB7PwN5']
 ['Fascination Street - Remastered' 'The Cure'
  'https://open.spotify.com/track/3lDAJbYBmCoWbQx93JTrea']
 ['Stadium Arcadium' 'Red Hot Chili Peppers'
  'https://open.spotify.com/track/4y84ILALZSa4LyP6H7NVjR']]
Cluster 8
[['Disco Ulysses (Instrumental)' 'Vulfpeck'
  'https://open.spotify.com/track/608uL3XBV8f2MHKNgm32Y3']
 ['Across The Universe - Remastered 2009' 'The Beatles'
  'https://open.spotify.com/track/4dkoqJrP0L8FXftrMZongF']
 ['Wake Up Dead - 2004 Remaster' 'Megadeth'
  'https://open.spotify.com/track/1I3qfFMraXE0kAPtRERpok']
 ['Plug in Baby' 'Muse'
  'https://open.spotify.com/track/2UKARCqDrhkYDoVR4FN5Wi']
 ['Perfection - Concealing Fate, Pt. 4' 'TesseracT'
  'https://open.spotify.com/track/1s2Ky2iOgSPXYsZ1RtyBtd']]
Cluster 9
[['The Trek' 'Primus'
  'https://open.spotify.com/track/15m0MEyKTpuwwdEBGAghyL']
 ['In My Darkest Hour - Remastered 2004' 'Megadeth'
  'https://open.spotify.com/track/5LO0sJCkNMZYLYeGOvblLu']
 ['Constant Motion' 'Dream Theater'
  'https://open.spotify.com/track/1ElUz8eHPMZitrPeAMiUng']
 ['Paranoid Android' 'Radiohead'
  'https://open.spotify.com/track/6LgJvl0Xdtc73RJ1mmpotq']
 ['Never Enough' 'Dream Theater'
  'https://open.spotify.com/track/2q7lLGG4xhUlluVKDHas2D']]
Cluster 10
[]
Cluster 11
[]
Cluster 12
[]
Cluster 13
[]
Cluster 14
[]

Although some clusters seem ok (the songs seem to share some traits), clusters seem very mixed, with a little bit of everything in each one, so it's not a good grouping, as one could easily assign a random cluster to a track and it has chances to align with the ML clustering.

Finally, a GMM model is fitted to the data to see if its results are better than KMeans.

The GMM model gives a probability for a data point to pertain to a certain cluster. We save the output with the highest probability, as well as all results that have a probability higher than a certain threshold (in those cases, the data point can be part of more than one cluster). We compare the results.

In [24]:
lower_limit = 0.5
gmm_components = 10

model = GaussianMixture(covariance_type="tied", n_components=gmm_components, random_state=random_state)
model.fit(X_scaled)
results = model.predict(X_scaled)
probabilities = model.predict_proba(X_scaled)

track_clustered = track_info.copy()
track_clustered["cluster_single"] = results
track_clustered["clusters"] = probabilities.tolist()
track_clustered["clusters"] = track_clustered["clusters"].apply(lambda x: [i for i, e in enumerate(x) if e > lower_limit])

track_clustered.head(5)
Out[24]:
num_sections danceability track_name sections_avg_duration instrumentalness liveness track_url loudness duration speechiness ... artist_id key_changes key tempo genres explicit case cluster_NLP cluster_single clusters
0 7.0 0.677 No You Girls 31.970 0.000077 0.0967 https://open.spotify.com/track/4VP8QiCeaZq8BeT... -4.102 223.79 0.0296 ... 0XNa1vTidXlvJ2gHSsRi4A 0.000000 D 104.780 [alternative rock, rock, indie rock, modern rock] False user [3, 4] 9 [9]
1 14.0 0.449 Cliffs Of Dover - Instrumental 17.841 0.149000 0.2480 https://open.spotify.com/track/5qm0KiVKMXW1kq6... -12.029 249.77 0.0405 ... 4CxobvwTpmfpIEbkYh4pAb 0.008007 G 94.907 [instrumental rock, progressive jazz fusion] False user [2, 3] 7 [7]
2 9.0 0.526 Genesis Ch.1. V.32 23.241 0.984000 0.1100 https://open.spotify.com/track/2Pmkm67wkf5ucIO... -9.747 209.17 0.0372 ... 2m62cc253Xvd9qYQ8d2X3d 0.000000 D 126.610 [classic rock, art rock, progressive rock, mel... False user [0, 3, 4] 9 [9]
3 13.0 0.745 Stop Don't Panic 20.936 0.588000 0.1880 https://open.spotify.com/track/38rSSEjYzSngmL0... -5.549 272.17 0.0502 ... 6J7biCazzYhU3gM9j1wfid 0.000000 C# 112.450 [dance pop] False user [1] 3 [3]
4 27.0 0.270 America - 2003 Remaster 23.392 0.283000 0.3010 https://open.spotify.com/track/5jleoXgOf41HdLI... -8.755 631.57 0.0630 ... 7AC976RDJzL2asmZuz7qil 0.004750 D 176.030 [classic rock, hard rock, art rock, progressiv... False user [3, 4] 9 [9]

5 rows × 33 columns

In [25]:
genre_info_clustered = track_clustered.explode("genres")
genre_info_clustered["genres"] = genre_info_clustered["genres"].fillna("No Genre")

cluster_items = genre_info_clustered[["cluster_single", "genres"]].groupby("cluster_single").agg(list).reset_index()
cluster_items["genres"] = cluster_items["genres"].apply(lambda x: list(set(x)))

for _, row in cluster_items.iterrows():
    print(f"Cluster {row['cluster_single']} (length: {len(row['genres'])}): {', '.join(row['genres'])}\n")
Cluster 0 (length: 67): electric bass, french death metal, metal, acid rock, oxford indie, art rock, progressive metal, grunge, funk rock, microtonal, nu metal, uk post-punk, jazz fusion, jam band, classic rock, groove metal, progressive groove metal, instrumental rock, conscious hip hop, rap metal, progressive jazz fusion, speed metal, album rock, modern rock, swedish metal, political hip hop, neo-psychedelic, garage rock, melodic thrash, hard rock, thrash metal, supergroup, progressive death metal, rap rock, dance pop, math rock, funk metal, jazz funk, double drumming, sacramento indie, french metal, indie rock, trip hop, permanent wave, post-grunge, swedish progressive metal, classic canadian rock, psychedelic rock, alternative rock, djent, alternative metal, stoner rock, el paso indie, new wave, singer-songwriter, canadian metal, jazz, australian psych, progressive rock, contemporary jazz, jazz rock, old school thrash, soft rock, zolo, rock, symphonic rock, melancholia

Cluster 1 (length: 55): zolo, french death metal, metal, instrumental djent, palm desert scene, oxford indie, art rock, jazz metal, progressive metal, grunge, funk rock, blues rock, microtonal, nu metal, jazz fusion, classic rock, groove metal, stoner metal, progressive groove metal, instrumental rock, rap metal, progressive jazz fusion, neo-psychedelic, modern rock, album rock, post-rock, swedish metal, garage rock, hard rock, supergroup, dance pop, math rock, funk metal, double drumming, french metal, trip hop, permanent wave, swedish progressive metal, psychedelic rock, djent, alternative rock, alternative metal, el paso indie, stoner rock, instrumental math rock, australian psych, progressive rock, mellow gold, technical groove metal, jazz rock, soft rock, technical thrash, rock, symphonic rock, melancholia

Cluster 2 (length: 68): electric bass, french death metal, palm desert scene, oxford indie, acid rock, art rock, progressive metal, grunge, funk rock, blues rock, microtonal, nu metal, uk post-punk, jazz fusion, jam band, classic rock, groove metal, stoner metal, progressive groove metal, noise pop, instrumental rock, rap metal, progressive jazz fusion, glam rock, swedish metal, modern rock, album rock, funk, neo-psychedelic, post-rock, garage rock, symphonic rock, hard rock, dance pop, math rock, funk metal, jazz funk, sacramento indie, double drumming, french metal, indie rock, permanent wave, shoegaze, p funk, post-grunge, classic canadian rock, djent, alternative rock, instrumental funk, alternative metal, stoner rock, new wave, el paso indie, psychedelic rock, singer-songwriter, canadian metal, jazz, instrumental math rock, australian psych, progressive rock, mellow gold, contemporary jazz, technical groove metal, soft rock, technical thrash, rock, metal, melancholia

Cluster 3 (length: 63): electric bass, french death metal, metal, instrumental djent, palm desert scene, oxford indie, art rock, jazz metal, progressive metal, grunge, funk rock, blues rock, microtonal, nu metal, jazz fusion, technical thrash, classic rock, groove metal, stoner metal, progressive groove metal, instrumental rock, rap metal, progressive jazz fusion, post-rock, modern rock, album rock, funk, swedish metal, neo-psychedelic, hard rock, supergroup, dance pop, funk metal, jazz funk, sacramento indie, double drumming, french metal, trip hop, permanent wave, p funk, post-grunge, psychedelic rock, djent, alternative rock, instrumental funk, alternative metal, stoner rock, classic canadian rock, singer-songwriter, canadian metal, jazz, instrumental math rock, australian psych, progressive rock, mellow gold, technical groove metal, jazz rock, contemporary jazz, soft rock, zolo, rock, symphonic rock, melancholia

Cluster 4 (length: 65): electric bass, palm desert scene, oxford indie, art rock, progressive metal, grunge, funk rock, blues rock, microtonal, nu metal, jazz fusion, classic rock, groove metal, stoner metal, noise pop, conscious hip hop, instrumental rock, rap metal, progressive jazz fusion, glam rock, speed metal, neo-psychedelic, modern rock, album rock, funk, political hip hop, garage rock, symphonic rock, melodic thrash, hard rock, thrash metal, rap rock, supergroup, dance pop, math rock, funk metal, jazz funk, double drumming, sacramento indie, trip hop, permanent wave, shoegaze, p funk, post-grunge, psychedelic rock, djent, alternative rock, instrumental funk, alternative metal, stoner rock, el paso indie, classic canadian rock, singer-songwriter, canadian metal, jazz, instrumental math rock, australian psych, progressive rock, mellow gold, contemporary jazz, old school thrash, soft rock, rock, metal, melancholia

Cluster 5 (length: 64): french death metal, instrumental djent, acid rock, oxford indie, art rock, jazz metal, progressive metal, grunge, funk rock, microtonal, nu metal, jam band, classic rock, groove metal, progressive groove metal, noise pop, instrumental rock, conscious hip hop, rap metal, progressive jazz fusion, speed metal, album rock, modern rock, swedish metal, political hip hop, post-rock, funk, neo-psychedelic, garage rock, symphonic rock, melodic thrash, hard rock, thrash metal, supergroup, progressive death metal, rap rock, math rock, funk metal, double drumming, indie rock, french metal, permanent wave, p funk, swedish progressive metal, post-grunge, classic canadian rock, psychedelic rock, alternative rock, djent, alternative metal, stoner rock, el paso indie, canadian metal, instrumental math rock, australian psych, progressive rock, mellow gold, old school thrash, jazz rock, soft rock, zolo, rock, metal, melancholia

Cluster 6 (length: 67): electric bass, french death metal, acid rock, oxford indie, art rock, progressive metal, grunge, funk rock, blues rock, microtonal, nu metal, uk post-punk, jam band, jazz fusion, classic rock, groove metal, progressive groove metal, instrumental rock, glam rock, progressive jazz fusion, rap metal, speed metal, album rock, modern rock, swedish metal, neo-psychedelic, garage rock, symphonic rock, melodic thrash, hard rock, thrash metal, supergroup, progressive death metal, dance pop, math rock, funk metal, jazz funk, sacramento indie, double drumming, french metal, indie rock, permanent wave, shoegaze, post-grunge, swedish progressive metal, psychedelic rock, djent, alternative rock, new wave, alternative metal, el paso indie, classic canadian rock, instrumental funk, singer-songwriter, canadian metal, jazz, australian psych, progressive rock, mellow gold, contemporary jazz, old school thrash, technical groove metal, soft rock, technical thrash, rock, metal, melancholia

Cluster 7 (length: 69): electric bass, french death metal, instrumental djent, palm desert scene, oxford indie, art rock, jazz metal, progressive metal, grunge, funk rock, blues rock, microtonal, nu metal, jazz fusion, jam band, classic rock, groove metal, stoner metal, progressive groove metal, instrumental rock, conscious hip hop, rap metal, progressive jazz fusion, speed metal, neo-psychedelic, modern rock, swedish metal, political hip hop, album rock, funk, garage rock, symphonic rock, melodic thrash, hard rock, thrash metal, rap rock, progressive death metal, supergroup, dance pop, funk metal, jazz funk, double drumming, sacramento indie, french metal, indie rock, permanent wave, p funk, post-grunge, swedish progressive metal, psychedelic rock, classic canadian rock, alternative rock, djent, alternative metal, stoner rock, el paso indie, instrumental funk, canadian metal, jazz, australian psych, progressive rock, contemporary jazz, old school thrash, jazz rock, soft rock, zolo, rock, metal, melancholia

Cluster 8 (length: 2): el paso indie, garage rock

Cluster 9 (length: 74): electric bass, french death metal, metal, instrumental djent, acid rock, palm desert scene, oxford indie, art rock, jazz metal, progressive metal, grunge, funk rock, blues rock, microtonal, nu metal, uk post-punk, jazz fusion, jam band, classic rock, groove metal, stoner metal, progressive groove metal, noise pop, instrumental rock, rap metal, progressive jazz fusion, glam rock, speed metal, album rock, modern rock, swedish metal, neo-psychedelic, post-rock, funk, garage rock, melodic thrash, hard rock, thrash metal, progressive death metal, dance pop, math rock, funk metal, jazz funk, sacramento indie, double drumming, indie rock, french metal, trip hop, permanent wave, shoegaze, p funk, swedish progressive metal, post-grunge, psychedelic rock, djent, alternative rock, instrumental funk, alternative metal, el paso indie, stoner rock, new wave, jazz, instrumental math rock, australian psych, progressive rock, mellow gold, contemporary jazz, jazz rock, old school thrash, soft rock, zolo, rock, symphonic rock, melancholia

In [26]:
for i in range(0, gmm_components):
    _df = track_clustered.loc[track_clustered["cluster_single"] == i, ["track_name", "artist", "track_url"]]
    elements = min(5, _df.shape[0])
    print(f"Cluster {i}")
    print(_df.sample(elements).values)
Cluster 0
[['One Big Holiday' 'My Morning Jacket'
  'https://open.spotify.com/track/4hcNGsiVC4bJRrucc6BAU1']
 ["Space Truckin' - Remastered 2012" 'Deep Purple'
  'https://open.spotify.com/track/5S126DaCBZ8z6yh7B1Lszr']
 ['(Nice Dream)' 'Radiohead'
  'https://open.spotify.com/track/1tZcw7GtIqviL32bzaKdSo']
 ['Moonchild' 'King Crimson'
  'https://open.spotify.com/track/0NNKkdcablz5mAdnzz8U40']
 ['A Letter To Elise' 'The Cure'
  'https://open.spotify.com/track/7mEGddVRDdESAibWOnbXoA']]
Cluster 1
[['K.G.L.W.' 'King Gizzard & The Lizard Wizard'
  'https://open.spotify.com/track/4QugdF1lazluZvplbfVjCR']
 ['Gordian Naught' 'Animals As Leaders'
  'https://open.spotify.com/track/7uhwNvGV8LaWoHsrawt6jD']
 ['Astroturf' 'King Gizzard & The Lizard Wizard'
  'https://open.spotify.com/track/5LjGOJDlKw1XyX9s0P8eFh']
 ['Hypnotize' 'System Of A Down'
  'https://open.spotify.com/track/16gpk3oHK8Ela7QbRNGjJd']
 ['Sgt. Baker' 'Primus'
  'https://open.spotify.com/track/6Kx7HX8d1EBe927L7V6vL4']]
Cluster 2
[['Talk Show Host' 'Radiohead'
  'https://open.spotify.com/track/3cMuGOGSaTWbwOurTS4b3Y']
 ['Lacquer Head' 'Primus'
  'https://open.spotify.com/track/6GAasbp33LAShmzwcaaFKC']
 ['125th Street Congress' 'Weather Report'
  'https://open.spotify.com/track/7qJkuqvDhyD20D1JCK2Aqy']
 ['Landmines' 'Rishloo'
  'https://open.spotify.com/track/0nMKVMCXCmRplsKKLt1TTh']
 ['Minerva' 'Deftones'
  'https://open.spotify.com/track/1gzWd0ILFaCoHUfQSkCIvl']]
Cluster 3
[['26 Ghosts III' 'Nine Inch Nails'
  'https://open.spotify.com/track/3lJjmUk53f2TgjArycRQyg']
 ['Mr. Brightside' 'The Killers'
  'https://open.spotify.com/track/3n3Ppam7vgaVa1iaRUc9Lp']
 ['Scentless Apprentice' 'Nirvana'
  'https://open.spotify.com/track/54UFDHWI2q7WHfrGbSNWph']
 ['Psychosphere' 'Periphery'
  'https://open.spotify.com/track/5Lvy7YsyBbDBYbjCnfZ2SQ']
 ['Conspiranoia' 'Primus'
  'https://open.spotify.com/track/0Cu06uJUB4AjdmZNdvLrpi']]
Cluster 4
[['G.O.A.T.' 'Polyphia'
  'https://open.spotify.com/track/0maCwhZTO3PybhSiQcsjAf']
 ['Snow (Hey Oh)' 'Red Hot Chili Peppers'
  'https://open.spotify.com/track/2aibwv5hGXSgw7Yru8IYTO']
 ['No Quarter - 1990 Remaster' 'Led Zeppelin'
  'https://open.spotify.com/track/2fQ2iALVbAZ7MkH6PaaIJ6']
 ['Kill Or Be Killed' 'Muse'
  'https://open.spotify.com/track/4E6pemZ3WutASrphiRINbd']
 ['Microphone Fiend' 'Rage Against The Machine'
  'https://open.spotify.com/track/1gGcKk7W1priUoTwotuoqT']]
Cluster 5
[['Wake Up Dead - 2004 Remaster' 'Megadeth'
  'https://open.spotify.com/track/1I3qfFMraXE0kAPtRERpok']
 ['The Conjuring - Remastered' 'Megadeth'
  'https://open.spotify.com/track/0pv49erP5wxMZMnprRCqXT']
 ['Never Walk Alone..A Call to Arms - 2019 - Remaster' 'Megadeth'
  'https://open.spotify.com/track/5IwzKBvIVsXKmAw91PAr0R']
 ['Cassandra Gemini: Plant A Nail In the Navel Stream' 'The Mars Volta'
  'https://open.spotify.com/track/6dFLb8SpYYlg4NxyJW7535']
 ['Cygnus...Vismund Cygnus' 'The Mars Volta'
  'https://open.spotify.com/track/2mmygsZnoEzJHXzEMgLd76']]
Cluster 6
[['I (21 Minute Track)' 'Meshuggah'
  'https://open.spotify.com/track/6mnLcQF5GYlFycGu2htkvb']
 ['Bag of Grins' 'Red Hot Chili Peppers'
  'https://open.spotify.com/track/1TmBaKkDUu0akM9xzSxRia']
 ['Hysteria - Live from Wembley Stadium' 'Muse'
  'https://open.spotify.com/track/0dLQn1q9rYqMQgH27jMOHf']
 ['Child Of Vision - 2010 Remastered' 'Supertramp'
  'https://open.spotify.com/track/70OA31UNhLkTL0M8CXMwJi']
 ['New Born - Live from Wembley Stadium' 'Muse'
  'https://open.spotify.com/track/7wYASfveKIDKlIWSsT2tCO']]
Cluster 7
[['Right In Two' 'TOOL'
  'https://open.spotify.com/track/0NLDZzVke3Qu7vDhWyGzRk']
 ['Turn It Again' 'Red Hot Chili Peppers'
  'https://open.spotify.com/track/4gJgHqy4BVCIEcGvx0hGLw']
 ['Error' 'Deftones'
  'https://open.spotify.com/track/5XQPlP8yHnXz5qTjIZ10gC']
 ['Jurassic | Cretaceous' 'The Ocean'
  'https://open.spotify.com/track/0ontughxaUxe6EQD7d9gFf']
 ['Reflection' 'TOOL'
  'https://open.spotify.com/track/0R7HFX1LW3E0ZR5BnAJLHz']]
Cluster 8
[['Zed and Two Naughts' 'The Mars Volta'
  'https://open.spotify.com/track/38CItr5N2JM2XFLLyWUVw0']
 ['Dyslexicon' 'The Mars Volta'
  'https://open.spotify.com/track/07DiJLA3Qef8OT23D4qlWC']]
Cluster 9
[['Blacklight Shine' 'The Mars Volta'
  'https://open.spotify.com/track/2DaRrPQ9ZJYlRH1ZogVPCk']
 ['Dancing With The Moonlit Knight - Remastered 2008' 'Genesis'
  'https://open.spotify.com/track/75n6R38rfp87ElycXr7OJq']
 ['Showbiz' 'Muse'
  'https://open.spotify.com/track/2sCFFlnYg6Lk75GCcfSXEz']
 ['Standing On The Verge Of Getting It On' 'Funkadelic'
  'https://open.spotify.com/track/3JMtGlh9gg0UCN7E2MLZUj']
 ['I Never Came' 'Queens of the Stone Age'
  'https://open.spotify.com/track/52mTWAP8mfQ37QPfwxmcAt']]
In [27]:
genre_info_clustered = track_clustered.explode("genres")
genre_info_clustered["genres"] = genre_info_clustered["genres"].fillna("No Genre")
genre_info_clustered = genre_info_clustered.explode("clusters")

cluster_items = genre_info_clustered[["clusters", "genres"]].groupby("clusters").agg(list).reset_index()
cluster_items["genres"] = cluster_items["genres"].apply(lambda x: list(set(x)))

for _, row in cluster_items.iterrows():
    print(f"Cluster {row['clusters']} (length: {len(row['genres'])}): {', '.join(row['genres'])}\n")
Cluster 0 (length: 67): electric bass, french death metal, metal, acid rock, oxford indie, art rock, progressive metal, grunge, funk rock, microtonal, nu metal, uk post-punk, jazz fusion, jam band, classic rock, groove metal, progressive groove metal, instrumental rock, conscious hip hop, rap metal, progressive jazz fusion, speed metal, album rock, modern rock, swedish metal, political hip hop, neo-psychedelic, garage rock, melodic thrash, hard rock, thrash metal, supergroup, progressive death metal, rap rock, dance pop, math rock, funk metal, jazz funk, double drumming, sacramento indie, french metal, indie rock, trip hop, permanent wave, post-grunge, swedish progressive metal, classic canadian rock, psychedelic rock, alternative rock, djent, alternative metal, stoner rock, el paso indie, new wave, singer-songwriter, canadian metal, jazz, australian psych, progressive rock, contemporary jazz, jazz rock, old school thrash, soft rock, zolo, rock, symphonic rock, melancholia

Cluster 1 (length: 55): zolo, french death metal, metal, instrumental djent, palm desert scene, oxford indie, art rock, jazz metal, progressive metal, grunge, funk rock, blues rock, microtonal, nu metal, jazz fusion, classic rock, groove metal, stoner metal, progressive groove metal, instrumental rock, rap metal, progressive jazz fusion, neo-psychedelic, modern rock, album rock, post-rock, swedish metal, garage rock, hard rock, supergroup, dance pop, math rock, funk metal, double drumming, french metal, trip hop, permanent wave, swedish progressive metal, psychedelic rock, djent, alternative rock, alternative metal, el paso indie, stoner rock, instrumental math rock, australian psych, progressive rock, mellow gold, technical groove metal, jazz rock, soft rock, technical thrash, rock, symphonic rock, melancholia

Cluster 2 (length: 68): electric bass, french death metal, palm desert scene, oxford indie, acid rock, art rock, progressive metal, grunge, funk rock, blues rock, microtonal, nu metal, uk post-punk, jazz fusion, jam band, classic rock, groove metal, stoner metal, progressive groove metal, noise pop, instrumental rock, rap metal, progressive jazz fusion, glam rock, swedish metal, modern rock, album rock, funk, neo-psychedelic, post-rock, garage rock, symphonic rock, hard rock, dance pop, math rock, funk metal, jazz funk, sacramento indie, double drumming, french metal, indie rock, permanent wave, shoegaze, p funk, post-grunge, classic canadian rock, djent, alternative rock, instrumental funk, alternative metal, stoner rock, new wave, el paso indie, psychedelic rock, singer-songwriter, canadian metal, jazz, instrumental math rock, australian psych, progressive rock, mellow gold, contemporary jazz, technical groove metal, soft rock, technical thrash, rock, metal, melancholia

Cluster 3 (length: 63): electric bass, french death metal, metal, instrumental djent, palm desert scene, oxford indie, art rock, jazz metal, progressive metal, grunge, funk rock, blues rock, microtonal, nu metal, jazz fusion, technical thrash, classic rock, groove metal, stoner metal, progressive groove metal, instrumental rock, rap metal, progressive jazz fusion, post-rock, modern rock, album rock, funk, swedish metal, neo-psychedelic, hard rock, supergroup, dance pop, funk metal, jazz funk, sacramento indie, double drumming, french metal, trip hop, permanent wave, p funk, post-grunge, psychedelic rock, djent, alternative rock, instrumental funk, alternative metal, stoner rock, classic canadian rock, singer-songwriter, canadian metal, jazz, instrumental math rock, australian psych, progressive rock, mellow gold, technical groove metal, jazz rock, contemporary jazz, soft rock, zolo, rock, symphonic rock, melancholia

Cluster 4 (length: 65): electric bass, palm desert scene, oxford indie, art rock, progressive metal, grunge, funk rock, blues rock, microtonal, nu metal, jazz fusion, classic rock, groove metal, stoner metal, noise pop, conscious hip hop, instrumental rock, rap metal, progressive jazz fusion, glam rock, speed metal, neo-psychedelic, modern rock, album rock, funk, political hip hop, garage rock, symphonic rock, melodic thrash, hard rock, thrash metal, rap rock, supergroup, dance pop, math rock, funk metal, jazz funk, double drumming, sacramento indie, trip hop, permanent wave, shoegaze, p funk, post-grunge, psychedelic rock, djent, alternative rock, instrumental funk, alternative metal, stoner rock, el paso indie, classic canadian rock, singer-songwriter, canadian metal, jazz, instrumental math rock, australian psych, progressive rock, mellow gold, contemporary jazz, old school thrash, soft rock, rock, metal, melancholia

Cluster 5 (length: 64): french death metal, instrumental djent, acid rock, oxford indie, art rock, jazz metal, progressive metal, grunge, funk rock, microtonal, nu metal, jam band, classic rock, groove metal, progressive groove metal, noise pop, instrumental rock, conscious hip hop, rap metal, progressive jazz fusion, speed metal, album rock, modern rock, swedish metal, political hip hop, post-rock, funk, neo-psychedelic, garage rock, symphonic rock, melodic thrash, hard rock, thrash metal, supergroup, progressive death metal, rap rock, math rock, funk metal, double drumming, indie rock, french metal, permanent wave, p funk, swedish progressive metal, post-grunge, classic canadian rock, psychedelic rock, alternative rock, djent, alternative metal, stoner rock, el paso indie, canadian metal, instrumental math rock, australian psych, progressive rock, mellow gold, old school thrash, jazz rock, soft rock, zolo, rock, metal, melancholia

Cluster 6 (length: 67): electric bass, french death metal, acid rock, oxford indie, art rock, progressive metal, grunge, funk rock, blues rock, microtonal, nu metal, uk post-punk, jam band, jazz fusion, classic rock, groove metal, progressive groove metal, instrumental rock, glam rock, progressive jazz fusion, rap metal, speed metal, album rock, modern rock, swedish metal, neo-psychedelic, garage rock, symphonic rock, melodic thrash, hard rock, thrash metal, supergroup, progressive death metal, dance pop, math rock, funk metal, jazz funk, sacramento indie, double drumming, french metal, indie rock, permanent wave, shoegaze, post-grunge, swedish progressive metal, psychedelic rock, djent, alternative rock, new wave, alternative metal, el paso indie, classic canadian rock, instrumental funk, singer-songwriter, canadian metal, jazz, australian psych, progressive rock, mellow gold, contemporary jazz, old school thrash, technical groove metal, soft rock, technical thrash, rock, metal, melancholia

Cluster 7 (length: 69): electric bass, french death metal, instrumental djent, palm desert scene, oxford indie, art rock, jazz metal, progressive metal, grunge, funk rock, blues rock, microtonal, nu metal, jazz fusion, jam band, classic rock, groove metal, stoner metal, progressive groove metal, instrumental rock, conscious hip hop, rap metal, progressive jazz fusion, speed metal, neo-psychedelic, modern rock, swedish metal, political hip hop, album rock, funk, garage rock, symphonic rock, melodic thrash, hard rock, thrash metal, rap rock, progressive death metal, supergroup, dance pop, funk metal, jazz funk, double drumming, sacramento indie, french metal, indie rock, permanent wave, p funk, post-grunge, swedish progressive metal, psychedelic rock, classic canadian rock, alternative rock, djent, alternative metal, stoner rock, el paso indie, instrumental funk, canadian metal, jazz, australian psych, progressive rock, contemporary jazz, old school thrash, jazz rock, soft rock, zolo, rock, metal, melancholia

Cluster 8 (length: 2): el paso indie, garage rock

Cluster 9 (length: 74): electric bass, french death metal, metal, instrumental djent, acid rock, palm desert scene, oxford indie, art rock, jazz metal, progressive metal, grunge, funk rock, blues rock, microtonal, nu metal, uk post-punk, jazz fusion, jam band, classic rock, groove metal, stoner metal, progressive groove metal, noise pop, instrumental rock, rap metal, progressive jazz fusion, glam rock, speed metal, album rock, modern rock, swedish metal, neo-psychedelic, post-rock, funk, garage rock, melodic thrash, hard rock, thrash metal, progressive death metal, dance pop, math rock, funk metal, jazz funk, sacramento indie, double drumming, indie rock, french metal, trip hop, permanent wave, shoegaze, p funk, swedish progressive metal, post-grunge, psychedelic rock, djent, alternative rock, instrumental funk, alternative metal, el paso indie, stoner rock, new wave, jazz, instrumental math rock, australian psych, progressive rock, mellow gold, contemporary jazz, jazz rock, old school thrash, soft rock, zolo, rock, symphonic rock, melancholia

In [28]:
df_exploded = track_clustered.explode("clusters")

for i in range(0, gmm_components):
    _df = df_exploded.loc[df_exploded["clusters"] == i, ["track_name", "artist", "track_url"]]
    elements = min(5, _df.shape[0])
    print(f"Cluster {i}")
    print(_df.sample(elements).values)
Cluster 0
[['Host' 'Muse' 'https://open.spotify.com/track/5BOMJwJhP84KsMLCqmTUYD']
 ['Jazz Hands of Doom' 'The Mercury Tree'
  'https://open.spotify.com/track/3DHRWhSqJN51BcHfKjBQL0']
 ['Eternal Life' 'Jeff Buckley'
  'https://open.spotify.com/track/7bf4nfz09yp6w7L7r9hQ1V']
 ['One More Red Nightmare' 'King Crimson'
  'https://open.spotify.com/track/2qANtYqqPUNY379QRLE0yl']
 ['Shrinking Universe' 'Muse'
  'https://open.spotify.com/track/4lUWIcM7hNaDeoIiD0NJSS']]
Cluster 1
[['Muscle Museum' 'Muse'
  'https://open.spotify.com/track/5rupf5kRDLhhFPxH15ZmBF']
 ['Harper Lewis' 'Russian Circles'
  'https://open.spotify.com/track/2iG7MCJdoM4NfvsK2AVgqY']
 ['Hourglass' 'Polyphia'
  'https://open.spotify.com/track/11ONhzEbaDcXQLygsPPWWN']
 ['The Shattered Fortress' 'Dream Theater'
  'https://open.spotify.com/track/0lphvKBt6vQbDAy0gUWk3C']
 ['Virtual Insanity - Remastered 2006' 'Jamiroquai'
  'https://open.spotify.com/track/47W6YR93MPCGLEUReLMyDm']]
Cluster 2
[['125th Street Congress' 'Weather Report'
  'https://open.spotify.com/track/7qJkuqvDhyD20D1JCK2Aqy']
 ['Mayonaise - 2011 Remaster' 'The Smashing Pumpkins'
  'https://open.spotify.com/track/0jmKzJmUEKNbC7eU8YfOiA']
 ['No Surprises' 'Radiohead'
  'https://open.spotify.com/track/10nyNJ6zNy2YVYLrcwLccB']
 ['Paranoid Android' 'Radiohead'
  'https://open.spotify.com/track/6LgJvl0Xdtc73RJ1mmpotq']
 ['The Gift of Guilt' 'Gojira'
  'https://open.spotify.com/track/5ke0lqEJZM2nAD3aRMxfV2']]
Cluster 3
[['Mushroom Men' 'Les Claypool'
  'https://open.spotify.com/track/1JyP5hSjws03QtKMAYg8O3']
 ['Meet Your Master' 'Nine Inch Nails'
  'https://open.spotify.com/track/3pDJBnCefNLS2oU4wpv1Rp']
 ['Deeper Underground - Full Version' 'Jamiroquai'
  'https://open.spotify.com/track/19x5x7F8SYMfWNiJOmqMUu']
 ['Blew' 'Nirvana'
  'https://open.spotify.com/track/7pETV41GUutaZ6KMHMAYIH']
 ['Karn Evil 9 1st Impression, Pt. 2 - 2014 Remaster'
  'Emerson, Lake & Palmer'
  'https://open.spotify.com/track/0nDQu5i6B93GvUJH8iJ0y9']]
Cluster 4
[['glimmer' 'Covet'
  'https://open.spotify.com/track/4kEhypsV7I5JAwHAesTKqR']
 ['No Quarter - 1990 Remaster' 'Led Zeppelin'
  'https://open.spotify.com/track/2fQ2iALVbAZ7MkH6PaaIJ6']
 ['Holy Wars...The Punishment Due - 2004 Remix' 'Megadeth'
  'https://open.spotify.com/track/5LyRtsQLhcXmy50VXhQXXS']
 ['G.O.A.T.' 'Polyphia'
  'https://open.spotify.com/track/0maCwhZTO3PybhSiQcsjAf']
 ["You've Seen the Butcher" 'Deftones'
  'https://open.spotify.com/track/0oHj2DHtNVWEgBqOa1bejc']]
Cluster 5
[['Of Reality - Palingenesis' 'TesseracT'
  'https://open.spotify.com/track/2SMWGgAT5C6IghiEiguY7I']
 ['Maggot Brain' 'Funkadelic'
  'https://open.spotify.com/track/3utrFnKNa1QSmUIm8QxHEC']
 ['Cassandra Gemini' 'The Mars Volta'
  'https://open.spotify.com/track/7niwGnlrDXwKtNteHbix2i']
 ['Cygnus...Vismund Cygnus' 'The Mars Volta'
  'https://open.spotify.com/track/2mmygsZnoEzJHXzEMgLd76']
 ['See Me' 'King Gizzard & The Lizard Wizard'
  'https://open.spotify.com/track/7LOgi68eugnunVQikONhTt']]
Cluster 6
[['Passenger' 'Deftones'
  'https://open.spotify.com/track/7IoK6jZBxY7NMoQPoPXZCF']
 ['The Cell' 'Gojira'
  'https://open.spotify.com/track/7nCD5l7GrFyt6o1mstCUFr']
 ['Plainsong - Remastered' 'The Cure'
  'https://open.spotify.com/track/4gcfxHL1iRgP0RHCDYMNIo']
 ['Child Of Vision - 2010 Remastered' 'Supertramp'
  'https://open.spotify.com/track/70OA31UNhLkTL0M8CXMwJi']
 ['Dialectic Chaos' 'Megadeth'
  'https://open.spotify.com/track/2Sl9U6mLbNeaE9lT9C32Td']]
Cluster 7
[['Cusp of Eternity' 'Opeth'
  'https://open.spotify.com/track/4pmqtt1M0hGwdyJocXyV1a']
 ['Flash Light' 'Parliament'
  'https://open.spotify.com/track/1v1PV2wERHiMPesMWX0qmO']
 ['The Fall' 'Gojira'
  'https://open.spotify.com/track/3ONrh7cP8vezCw1Q3UJNOn']
 ['Starship Trooper: a. Life Seeker, b. Disillusion, c. Würm' 'Yes'
  'https://open.spotify.com/track/1K75F5lMyhGFqbM8HwWSpS']
 ['Viscera Eyes' 'The Mars Volta'
  'https://open.spotify.com/track/6Ae1bfA16wez1qNfVzOyFb']]
Cluster 8
[['Dyslexicon' 'The Mars Volta'
  'https://open.spotify.com/track/07DiJLA3Qef8OT23D4qlWC']
 ['Zed and Two Naughts' 'The Mars Volta'
  'https://open.spotify.com/track/38CItr5N2JM2XFLLyWUVw0']]
Cluster 9
[['Chimera (feat. Lil West)' 'Polyphia'
  'https://open.spotify.com/track/5aVKIdM550lRzk7rFbPcF7']
 ['Teen Town' 'Weather Report'
  'https://open.spotify.com/track/4OzXE9NnSdD9aEAwBcnYBI']
 ['Nature Boy' 'Primus'
  'https://open.spotify.com/track/45z51HNfBAhKZ8D8tYBnxZ']
 ['Noctourniquet' 'The Mars Volta'
  'https://open.spotify.com/track/25Bf9H3ixSgWlicaLhiRFL']
 ['Mongoose Walk' 'Stanley Clarke'
  'https://open.spotify.com/track/5LrKQKJSPt8xiYRgfnHuXh']]

Results are similar to those of KMeans: clusters of very high size (some of more than half of classes), and seem very mixed, with a little bit of everything in almost all of them. It's not a good classification.

¶

Conclusions¶

It has been seen that clustering when there is a high amount of classes can be a very challenging. We know a priori that some genres are related (rock and its subgenres are all "rock" music), but trying to automatize the grouping without manually assigning labels to each genre has been a little problematic. This shows the difficulty in doing a ML algorithm with a subjective target (we know that, for example, indie and alternative music can have similarities, and they usually are very different from, for example, extreme metal or classical music), and with so much variety in the classes.

Reducing the dataset to include only the most represented genres, the results are somewhat ok with the semantic clustering, with groups of moderate size and sensible aggregations. Also, being a more subjective classification, the semantic transformer may include some subjective relationship between the words used in the genres, so this helps in clustering. This approach also shows that, in some cases, it's better (both faster and with better results) to use a pre-trained model that fits to the task than "reinventing the wheel" and trying to construct a complex model.

When using features to cluster, the results are way off, mainly because a song can have several genres, and not all of them may apply to the song, as it's an artist info; also, there is no quantification of "how much" a song pertains to a certain genre, so this can't be simplified in an easy way; and finally, it can be debated if a song or artist is of a certain genre according to the Spotify classification.